June 14th, 2006
Update: A lot of things happened since the publication of this article. First of all, I have updated this article with HPricot and scRUBYt! examples - then I wrote the second part, I hacked up a Ruby web-scraping toolkit, scRUBYt! which also has a community web page - check it out, it’s hot right now!
Introduction
Despite of the ongoing Web 2.0 buzz, the absolute majority of the Web pages are still very Web 1.0: They heavily mix presentation with content. [1] This makes hard or impossible for a computer to tell off the wheat from the chaff: to sift out meaningful data from the rest of the elements used for formatting, spacing, decoration or site navigation.
To remedy this problem, some sites provide access to their content through APIs (typically via web services), but in practice nowadays this is limited to a few (big) sites, and some of them are not even free or public. In an ideal Web 2.0 world, where data sharing and site interoperability is one of the basic principles, this should change soon(?) - but what should one do if he needs the data NOW and not in the likely-to-happen-future?
Manic Miner
The solution is called screen/Web scraping or Web extraction - mining Web data by observing the page structure and wrapping out the relevant records. In some cases the task is even more complex than that: The data can be scattered over more pages, triggering of a GET/POST request may be needed to get the input page for the extraction or authorization may be required to navigate to the page of interest. Ruby has solutions for these issues, too - we will take a look at them as well.
The extracted data can be used in any way you like - to create mashups (e.g. chicagocrime.org by Django author Adrian Holovaty), to remix and present the relevant data (e.g. rubystuff.com com by ruby-doc.org maintainer James Britt), to automatize processes (for example if you have more bank accounts, to get the sum of the money you have all together, without using your browser), monitor/compare prices/items, meta-search, create a semantic web page out of a regular one - just to name a few. The number of the possibilities is limited by your imagination only.
Tools of the trade
In this section we will check out the two main possibilities (string and tree based wrappers) and take a look at HTree, REXML, RubyfulSoup and WWW::Mechanize based solutions.
String wrappers
The easiest (but in most of the cases inadequate) possibility is to view the HTML document as a string. In this case you can use regular expressions to mine the relevant data. For example if you would like to extract names of goods and their price from a Web shop, and you know that they are both in the same HTML element, like:
<td>Samsung SyncMasta 21''LCD $750.00</td>
you can extract this record from Ruby with this code snippet:
scan(page, /<td>(.*)\s+(\$\d+\.\d{2})<\/td>/)
Let’s see a real (although simple) example:
1 require 'open-uri'
2 url = "http://www.google.com/search?q=ruby"
3 open(url) {
4 |page| page_content = page.read()
5 links = page_content.scan(/<a class=l.*?href=\"(.*?)\"/).flatten
6 links.each {|link| puts link}
7 }
The first and crucial part of creating the wrapper program was the observation of the page source: We had to look for something that appears only in the result links. In this case this was the presence of the ‘class’ attribute, with value ‘l’. This task is usually not this easy, but for illustration purposes it serves well.
This minimalistic example shows the basic concepts: How to load the contents of a Web page into a string (line 4), and how to extract the result links on a google search result page (line 5). (After execution, the program will list the first 10 links of a google search query for the word ‘ruby’ (line 6)).
However, in practice you will mostly need to extract data which are not in a contiguous string, but contained in multiple HTML tags, or divided in a way where a string is not the proper structure for searching. In this case it is better to view the HTML document as a tree.[2]
Tree wrappers
The tree-based approach, although enables more powerful techniques, has its problems, too: The HTML document can look very good in a browser, yet still be seriously malformed (unclosed/misused tags). It is a non-trivial problem to parse such a document into a structured format like XML, since XML parsers can work with well-formed documents only.
HTree and REXML
There is a solution (in most of the cases) for this problem, too: It is called HTree. This handy package is able to tidy up the malformed HTML input, turning it to XML - the recent version is capable to transform the input into the nicest possible XML from our point of view: a REXML Document. ( REXML is Ruby’s standard XML/XPath processing library).
After preprocessing the page content with HTree, you can unleash the full power of XPath, which is a very powerful XML document querying language, highly suitable for web extraction.
Refer to [3] for the installation instructions of HTree.
Let’s revisit the previous Google example:
1 require 'open-uri'
2 require 'htree'
3 require 'rexml/document'
4 url = "http://www.google.com/search?q=ruby"
5 open(url) {
6 |page| page_content = page.read()
7 doc = HTree(page_content).to_rexml
8 doc.root.each_element('//a[@class="l"]') {
|elem| puts elem.attribute('href').value }
9 }
HTree is used in the 7th line only - it converts the HTML page (loaded into the pageContent variable on the previous line) into a REXML Document. The real magic happens in the 8th line. We select all the <a> tags which have an attribute ‘class’ with the value ‘l’, then for each such element write out the ‘href’ attribute. [4] I think this approach is much more natural for querying an XML document than a regular expression. The only drawback is that you have to learn a new language, XPath, which is (mainly from version 2.0) quite difficult to master. However, just to get started you do not need to know much of it, yet you gain lots of raw power compared to the possibilities offered by regular expressions.
Hpricot
Hpricot is “a Fast, Enjoyable HTML Parser for Ruby” by one of the coolest (Ruby) programmers of our century, why the lucky stiff. From my experience, the tag line is absolutely correct - Hpricot is both very fast (thanks to a C based scanner implementation) and really fun to use. It is based on HTree and JQuery, thus it can provide the same functionality as the previous Htree + REXML combination, but with a much better performance and greater ease of use. Let’s see the google example again - I guess you will understand instantly what I mean!
1 require 'rubygems'
2 require 'hpricot'
3 require 'open-uri'
4 doc = Hpricot(open('http://www.google.com/search?q=ruby'))
5 links = doc/"//a[@class=l]"
6 links.map.each {|link| puts link.attributes['href']}
Well, though this was slightly easier than with the tools seen so far, this example does not really show the power of Hpricot - there is much, much, much more in the store: different kinds of parsing, CSS selectors and searches, nearly full XPath support, and lots of chunky bacon! If you are doing something smaller and don’t need the power of scRUBYt!, my advice is to definitely use Hpricot from the tools listed here. For more information, installation instructions, tutorials and documentation check out Hpricot’ s homepage!
RubyfulSoup
Rubyfulsoup is a very powerful Ruby
screen-scraping package, which offers
similar possibilities like HTree + XPath. For people who are not handy with XML/XPath,
RubyfulSoup may be a wise compromise: It’s an all-in-one, effective HTML parsing
and web scraping tool with Ruby-like syntax. Although it’s expressive power
lags behind XPath2.0, it should be adequate in 90% of the cases. If your problem is in the
remaining 10%, you probably don’t need to read this tutorial anyway
Installation instructions can be found here: [5].
The google example again:
1 require 'rubygems'
2 require 'rubyful_soup'
3 require 'open-uri'
4 url = "http://www.google.com/search?q=ruby"
5 open(url) {
6 |page| page_content = page.read()
7 soup = BeautifulSoup.new(page_content)
8 result = soup.find_all('a', :attrs => {'class' => 'l'})
9 result.each { |tag| puts tag['href'] }
10 }
As you can see, the difference between the HTree + REXML and RubyfulSoup examples is minimal - basically it is limited to differences in the querying syntax. On line 8, you look up all the <a> tags, with the specified attribute list (in this case a hash with a single pair { ‘class’ => ‘l’ } ) The other syntactical difference is looking up the value of the ‘href’ attribute on line 9.
I have found RubyfulSoup the ideal tool for screen scraping from a single page - however web navigation (GET/POST, authentication, following links) is not really possible or obscure at best with this tool (which is perfectly OK, since it does not aim to provide this functionality). However, there is nothing to fear - the next package is doing just exactly that.
WWW::Mechanize
As of today, prevalent majority of data resides in the deep Web - databases, that are accessible via querying through web-forms. For example if you would like to get information on flights from New York to Chicago, you will (hopefully) not search for it on google - you go to the website of the Ruby Airlines instead, fill in the adequate fields and click on search. The information which appears is not available on a static page - it’s looked up on demand and generated on the fly - so until the very moment the web server generates it for you , its practically non-existent (i.e. it resides in the deep Web) and hence impossible to extract. At this point WWW::Mecahnize comes into play. (See [6] for installation instructions)
WWW::Mechanize belongs to the family of screen scraping products (along with http-access2 and Watir) that are capable to drive a browser. Let’s apply the ‘Show, don’t tell’ mantra - for everybody’s delight and surprise, illustrated on our google scenario:
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get('http://www.google.com')
search_form = page.forms.with.name("f").first
search_form.fields.name("q").first.value = "ruby"
search_results = agent.submit(search_form)
search_results.links.each {
|link| puts link.href if link.class_name == "l" }
I have to admit that i have been cheating with this one ;-). I had to hack WWW::Mechanize to access a custom attribute (in this case ‘class’) because normally this is not available. See how i did it here: [7]
This example illustrates a major difference between RubyfulSoup and Mechanize: additionally to screen scraping functionality, WWW::mechanize is able to drive the web browser like a human user: It filled in the search form and clicked the ’search’ button, navigating to the result page, then performed screen scraping on the results.
This example also pointed out the fact that RubyfulSoup - although lacking navigation possibilities - is much more powerful in screen scraping. For example, as of now, you can not extract arbitrary (say <p>) tags with Mechanize, and as the example illustrated, attribute extraction is not possible either - not to mention more complex, XPath like queries (e.g. the third <td> in the second <tr>) which is easy with RubyfulSoup/REXML. My recommendation is to combine these tools, as pointed out in the last section of this article.
scRUBYt!
scRUBYt! is a simple to learn and use, yet very powerful web extraction framework written in Ruby, based on Hpricot and Mechanize. Well, yeah, I made it
so this is kind of a self promotion, but I think (hopefully not just because being overly biased ;-)) it is the most powerful web extraction toolkit available to date. scRUBYt! can navigate through the Web (like clicking links, filling textfields, crawling to further pages - thanks to mechanize), extract, query, transform and save relevant data from the Web page of your interest by the concise and easy to use DSL (thanks to Hpricot and a lots of smart heuristics).
OK, enough talking - let’s see it in action! I guess this is rather annoying now for the 6th time, but let’s revisit the google example once more! (for the last time, I promise
1 require 'rubygems' 2 require 'scrubyt' 3 google_data = Scrubyt::Extractor.define do 4 fetch 'http://www.google.com/ncr' 5 fill_textfield 'q', 'ruby' 6 submit 7 result 'Ruby Programming Language' do 8 link 'href', :type => :attribute 9 end 10 end 11 google_data.to_xml.write($stdout, 1) 12 Scrubyt::ResultDumper.print_statistics(google_data)
Oputput:
<root>
<result>
<link>http://www.ruby-lang.org/</link>
</result>
<result>
<link>http://www.ruby-lang.org/en/20020101.html</link>
</result>
<result>
<link>http://en.wikipedia.org/wiki/Ruby_programming_language</link>
</result>
<result>
<link>http://en.wikipedia.org/wiki/Ruby</link>
</result>
<result>
<link>http://www.rubyonrails.org/</link>
</result>
<result>
<link>http://www.rubycentral.com/</link>
</result>
<result>
<link>http://www.rubycentral.com/book/</link>
</result>
<result>
<link>http://www.w3.org/TR/ruby/</link>
</result>
<result>
<link>http://poignantguide.net/</link>
</result>
<result>
<link>http://www.zenspider.com/Languages/Ruby/QuickRef.html</link>
</result>
</root>
result extracted 10 instances.
link extracted 10 instances.
You can donwload this example from here.
Though the code snippet is not really shorter, maybe even longer than the other ones, there are a lots of thing to note here: First of all, instead of loading the page directly (you can do that as well, of course), scRUBYt allows you to navigate there by going to google, filling the appropriate text field and submitting the search. The next interesting thing is that you need no XPaths or other mechanism to query your data - you just copy’n’ paste some examples from the page, and that’s it. Also, the whole description of the scraping process is more human friendly - you do not need to care about URLs, HTML, passing the document around, handling the result - everything is hidden from you and controlled by scRUBYt!’s DSL instead. You even get a nice statistics on how much stuff was extracted.
The above example is just the top of the iceberg - there is much, much, much more in scRUBYt! than what you have seen so far. If you would like to know more, check out the tutorials and other goodies on scRUBYt!’s homepage.
WATIR
From the WATIR page:
WATIR stands for “Web Application Testing in Ruby”. Watir drives the Internet Explorer browser the same way people do. It clicks links, fills in forms, presses buttons. Watir also checks results, such as whether expected text appears on the page.
Unfortunately I have no experience with WATIR since i am a linux-only nerd, using windows for occasional gaming but not for development, so I can not tell anything about it from the first hand, but judging from the mailing list contributions i think Watir is more mature and feature-rich than mechanize. Definitely check it out if you are running on Win32.
The silver bullet
For a complex scenario, usually an amalgam of the above tools can provide the ultimate solution: The combination of WWW::Mechanize or WATIR (for automatization of site navigation), RubyfulSoup (for serious screen scraping, where the above two are not enough) and HTree+REXML (for extreme cases where even RubyfulSoup can’t help you).
I have been creating industry-strength, robust and effective screen scraping solutions in the last five years of my career, and i can show you a handful of pages where even the most sophisticated solutions do not work (and i am not talking about scraping with RubyfulSoup here, but even more powerful solutions (like embedding mozilla in your application and directly accessing the DOM etc)). So the basic rule is: there is no spoon (err… silver bullet) - and i know by experience that the number of ‘hard-to-scrap’ sites is rising (partially because of the Web 2.0 stuff like AJAX, but also because some people would not like their sites to be extracted and apply different anti-scraping masquerading techniques).
The described tools should be enough to get you started - additionally, you may have to figure out how to drill down to your stuff on the concrete page of interest.
In the next installment of this series, i will create a mashup application using the introduced tools, from some
more interesting data than google
The results will be presented on a Ruby on Rails powered page, in a sortable AJAX table.
attr_reader :class_nameInto the constructor:
@class_name = node.attributes['class']
June 14th, 2006 at 2:56 am
[...] http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails [...]
June 14th, 2006 at 11:28 am
There is a project to port Watir for Firefox, just FYI - it’s called FireWatir
http://wiki.mozilla.org/SoftwareTesting:WatirandFirefox
June 14th, 2006 at 11:30 am
those were supposed to be underscores around the and, between Watir and Firefox, in the url in my above comment - I don’t know how those got altered - sorry.
June 14th, 2006 at 11:35 am
Chris,
Thanks for the link! We are developing a screen scraping application just now which is a Firefox extension, so i am quite involved with Firefox and good to know about stuff like FireWatir.
About the underscores - i guess it is WordPress. For example this was written as asterisk-this-asterisk and now you can see it in bold. Probably undescore is a shortcut for italic i guess…
June 14th, 2006 at 12:30 pm
I am not commenting on your blog because you had a captcha-esque ‘please add 10 and 0′ field, it derided me as ‘not knowning math’ when I entered “10″, and it eradicated all the contents of my post rather than letting me take the challenge again. Comment spam is annoying, but my time is better spent bitching about your way of handling it than actually rewriting my post and helping you out.
June 14th, 2006 at 2:00 pm
@aa:
June 14th, 2006 at 2:29 pm
OK, I have turned off the captcha until i find something more convenient… So if the comments will be full of spam its because of that
June 14th, 2006 at 5:45 pm
With WWW::Mechanize you can get the parsed rexml document and it also adds convenience methods to this REXML::Document
agent = WWW::Mechanize.new
page = agent.get(’http://www.google.com’)
form = page.forms.first
form.fields.name(’q').value = ‘ruby’
searchresults = agent.submit(form)
searchresults.root.each_element(’//a[@class="l"]‘) {|elem| puts elem.attribute(’href’).value }
June 14th, 2006 at 5:54 pm
Absolutely superb article. I’ve generally always put up with using old fashioned regexp in my screen scraping and didn’t know of these other methods until now. You’ve opened my eyes. Thank you.
June 14th, 2006 at 10:04 pm
[...] Screen Scraping With Ruby - a tutorial. [...]
June 15th, 2006 at 4:50 am
Super neat. I’m expecting a new one, specially using Gecko’s DOM.
June 15th, 2006 at 5:00 am
Great introduction. You might want to add an link to your main page in the posts. I tried clicking the header but there seems to be no link. Now I will have to go back to Reddit to find out your blog’s url (I’m using sharpreader).
June 15th, 2006 at 5:40 am
@Leonardo:
I have a fully working and tested Java solution for that - but there i have every building stone ( Java gecko widget - currently using SWT.Browser but there are alternatives like Ajax Toolkit Framework and XULRunner which are even better) and JavaXPCOM + W3CConnector to communicate between mozilla and java)
The problem with Ruby is that although both of these things are there (RubyGecko, GTK::Mozembed) and rbXPCOM, they are in a very-very immature state, i am not sure if even usable. So although i have all the know-how to build such a ting, i am not sure whether the building blocks allow me to do this.
@frank:
Thanks for the suggestion! I will do that ASAP.
June 15th, 2006 at 7:49 pm
Do you published the solution’s source code? Maybe I can help…
June 16th, 2006 at 12:54 am
Font size…
Gee whats with the small font size on this page, the code blocks are unreadable unless font size is increased by the browser
June 16th, 2006 at 1:08 am
Since more people have been complaining about the font size/line height i have modified it a bit for both the text and the source code. Thanks for the feedback, i am continously trying to improve the look, so suggestions are welcome!
June 16th, 2006 at 1:11 am
@Leonardo:
Could you please PM me? I’d prefer to talk this over via e-mail rather than a WordPress comment page
June 16th, 2006 at 1:26 pm
I’ve been working on a detailed project to parse and quantify a complicated course listing website for my college. Unfortunately, the site is a HTML throwback to the early 90’s and does not differentiate between listings in any meaningful way. As a result, the only thing capable of parsing the sea of random tags is a set of carefully constructed regex’s. This is would break very easily if they ever bothered to change how they did their markup, but it works in this case.
As I work on this, I’m constructing a parsing toolkit designed to abstract some of the repetitive regex tasks I frequently go through. While gross overkill for a nicely formatted site, it’s the only thing that seems to work with this html eyesore.
June 18th, 2006 at 11:35 pm
[...] Peter Szinek, owner of RubyRailWays, has announced a serie of articles about screen-scraping subjects. The first article «Data extraction for Web 2.0: Screen scraping in Ruby/Rails» was recently published. [...]
June 19th, 2006 at 12:48 pm
I would like to talk to you… I have been in a company that commericialized the first two methods you speak of - HTree and REXML. With a GUI designer.
I have a few thoughts about commercial applications… that could be monetized.
June 19th, 2006 at 12:55 pm
@Noel:
You can reach me at peter@[thissite].com. Feel free to send me an email!
June 19th, 2006 at 5:54 pm
Thanks for the techniques listed here.
I’m going to go make a few screens my bitch using these techniques.
June 19th, 2006 at 6:02 pm
Thanks for the mention of rubystuff.com. That site is itself created by scraping content from CafePress, using WWW::Mechanize.
Shamless plug; I wrote about that here: http://neurogami.com/cafe-fetcher/
June 20th, 2006 at 4:59 pm
HTree or HTML::XMLParser
It seems HTML::XMLParser is already included in ruby (in either net/http or mechanize or rexml ?) is already included and does pretty much the same thing as HTree without an extra download. Any reason you prefer HTree?
June 21st, 2006 at 11:20 pm
[...] Ruby, Rails, Web2.0 » Data extraction for Web 2.0: Screen scraping in Ruby/Rails “In this section we will check out the two main possibilities (string and tree based wrappers) and take a look at HTree, REXML, RubyfulSoup and WWW::Mechanize based solutions.” (tags: scraping) [...]
June 22nd, 2006 at 4:02 am
@RMX:
Well, the reason for this is very prosaic: I did not know HTML::XMLParser beforehand.
I will chcek it out and see what’s the difference between HTree and XMLParser…
June 22nd, 2006 at 2:14 pm
[...] Peter Szinek a étudié les différentes possibilités de screen scraping/extraction Web/navigation Web automatique avec Ruby”, il en a sortis un article comparant les différentes librairies Ruby utilisables dans ce domaine. [...]
June 25th, 2006 at 9:08 am
[...] Sometimes it feels a bit backwards scraping sites for microformats, maybe there’s scope for microformat returning webservices in the future. For the time being, if you’re wanting to parse sites in ruby there are several tools. I began by using the HTML lib which is used by assert_tag and friends in rails, but then ran into problems when giving it malformed XHTML. Now I’ve ended up with RubyfulSoup which is doing the job nicely. Other options are covered in this article. [...]
June 30th, 2006 at 1:32 pm
Data extraction for Web 2.0: Screen scraping in Ruby/Rails…
introduction to screen scraping/Web extraction with Ruby, evaluation of the tools along with installation instructions and examples….
July 12th, 2006 at 7:40 am
[...] The indefatigable Assaf Arkin has done it again by developing a new Ruby HTML scraping toolkit, scrAPI. Peter Szinek recently wrote a popular article about scraping from Ruby using Manic Miner, RubyfulSoup, REXML, and WWW::Mechanize, but none of these are as immediately useful as scrAPI.. so why? [...]
July 20th, 2006 at 9:09 pm
I am pretty new to this web scraping stuff…can anyone tell me what are the major business usecases for this scraping? i know this web20 mashup’s does this but any commercial application does this?
tia.
August 5th, 2006 at 8:58 pm
Uday — There are a few different business cases I can think of. A primary one is marketing where you might want to build a contact list for your sales force to call or other sorts of targeting. There are many databases online that contain a lot of useful information.
Other times maybe you are trying to automate a process you have to do often. I saw an author who use a technique like this to track sales. There are other examples like tracking ebay bids on certain items that a power seller might find useful. There are many times where you want to take data from a web page and turn it into structured data for your own purposes.
August 12th, 2006 at 9:31 am
Very helpfull and interesting article. Wanted to ask your opinion on scrAPI aswell. Looking forward to your next article on this subject.
August 20th, 2006 at 8:36 am
Hi,
When I copy paste your HTree example it gives error:
undefined method `HTree’ for main:Object (NoMethodError)
on the usage of the HTree class. The following seems to work fine:
require ‘open-uri’
require ‘htree/parse’
require ‘htree/rexml’
require ‘rexml/document’
url = “http://www.google.com/search?q=ruby”
open(url) {
|page| pagecontent = page.read()
doc = HTree.parse(pagecontent).torexml
doc.root.eachelement(’//a[@class="l"]‘) {
|elem| puts elem.attribute(’href’).value }
}
August 20th, 2006 at 8:42 am
Rubyful seems to change utf-8 characters, for instance into %nbsp Is this standard behaviour?
August 20th, 2006 at 8:57 am
Duh sorry about that, I meant to say the is translated into %nbsp
September 3rd, 2006 at 2:41 am
Nice post…
September 7th, 2006 at 1:18 am
[...] Zigmal duch die selben Web-Formulare klicken. Zigmal Hamburg als Bundesland auswählen und mit gedrückter STRG-Taste die bevorzugten Stadtteile auswählen. Und jedesmal geht ein Pop-Under mit auf. Mir reichts! Motiviert von einem Blog-Eintrag auf Rubyrailways von Peter Szinek aus dem schönen Wien (küss die Hand), habe ich mir das Mechanize Modul von Michael Neumann und Aaron Patterson mal etwas genauer angesehen. Im Grunde simuliert es einen Web-Browser und lässt sich mit [...]
September 20th, 2006 at 3:28 pm
HTML::XMLParser?
It took me awhile, but I figured out what RMX was talking about. For the curious:
gem install htmltools
then
require_gem ‘htmltools’
require ‘html/xmltree.rb’
parser = HTMLTree::XMLParser.new(false, false)
parser.parsefilenamed(’my.html’)
doc = parser.document # is a REXML::Document
Check out lib/html/xmltree.rb at http://ruby-htmltools.rubyforge.org/doc/ for more info. Seems to be functionally identical to htree. Slightly easier to install, but also in my very limited testing almost twice as slow.
-chuck
September 26th, 2006 at 11:35 pm
Thanks for the information, I needed a pick me up.
October 3rd, 2006 at 11:15 pm
Hey ‘RMX’: I don’t see any HTML::XMLParser in the standard distribution. You would think before sending people on wild goose chases looking in three different places you say it might be (one of which, Mechanize, isn’t even there either) you would be a little more sure of it yourself. Check your facts next time.
October 4th, 2006 at 12:15 am
@Mark:
Don’t worry about RMX’s tips
There is a better (by far) solution already: HPricot by why. I am working on my Ruby web-extraction framework right now - using HPricot - and I can tell you, it is absolutely the way to go. It is waaaay faster then any other tool, and they say it has also better shaky-html-parsing capabilities. Well, so far I did not have any problems with any page, and it is really, really lightning fast compared to HTRee + REXML or RubyfulSoup.
October 6th, 2006 at 9:37 pm
Ok Peter I’ll check that out. I’ll also look for your web abstraction framework.
December 17th, 2006 at 7:06 pm
yes, I agree with Bob. rubyfulsoup seems to translate html entity references like “ ” and “é” into “%nbsp” and “%eacute” respectively. Of course, the problem might be in the SGML parser code that rubyful soup uses. It sure would be nice if the community could discuss this problem and its solutions in more detail.
Cs
February 6th, 2007 at 7:53 pm
[...] Article 1 [...]
February 23rd, 2007 at 3:49 pm
Instead of altering the gem, you could just add this at the top of your example :
class WWW::Mechanize::Link
def class_name
node.attributes['class']
end
end
February 23rd, 2007 at 4:02 pm
Yeah, that’s absolutely true and it’s definitely the Ruby way - unfortunately when I wrote this article I was totally new to Ruby and (coming from Java) I forgot about the possibility to reopen a class…
March 6th, 2007 at 8:16 am
you could have a look on http://www.knowlesys.com, they provide web data extraction service.
March 9th, 2007 at 12:29 pm
[...] Various tools for screen scrapping Filed under: Uncategorized — bngu @ 11:29 am I came across this article that discussed several tools for screen scraping. The tools mentioned are string wrappers and tree wrappers. String wrapper is basic and not very flexible. Tree wrappers have several options: HTree, Hpricot, RubyfulSoup, WWW::Mechanize, scRUBYt!, WATIR. For examples and in-depth discussion of each of the tool, check out the article. [...]
April 3rd, 2007 at 12:08 am
[...] scRUBYt! is a simple to learn and use, yet powerful web scraping toolkit written in Ruby. The idea behind making scRUBYt! was to show a few simple concepts of Web extraction as a practical extension of this tutorial. [...]
April 5th, 2007 at 7:31 pm
Ruby Bikini - How to Process XML in Ruby…
Continuing in the series of Brazilian bikini Web development tutorials, here is an experiment with the Yahoo Search API, Ruby and Brazilian bikinis….
April 16th, 2007 at 3:13 pm
I would like to use this in conjunction with trying to send a website URL to a validator at: http://validator.w3.org/, I have been reading your articles on Information Acquisition Process, tutorials, and what not. I have installed everything with no issues, and I’m just wondering where do I start, you have these examples but what do I do with it, does it go in a controller that I have made say Validator_controller? Could you possible guide me through this as I don’t really have a clue.
What I’m trying to do is have send a website URL to a validator like the one above, and then grab all the validation results etc, and display it on a page in my web application. Any help would be greatly received, oh I signed up to your forum, but but I never received my activation email? I checked my email and it is correct, my login was solidariti, if you want to check.
Thank you
May 2nd, 2007 at 8:27 am
[...] 2)http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails This is a brief overview of scraping methods in Ruby. The author is a wee bit biased (but very knowledgeable) towards his own scraper-class: ScrubyT. I have not used ScrubyT since I am on a WIN32 machine and it wont work for me without some major tweaking. But he also goes over Hpricot, and Mechanize, which I use extensively. [...]
May 19th, 2007 at 8:39 pm
Continuing in the series of Brazilian bikini Web development tutorials, here is an experiment with the Yahoo Search API, Ruby and Brazilian bikinis….
May 30th, 2007 at 12:43 am
Hi —
I am new to this scrapping technology. I was recently assigned a project which needs certain information to be scrapped from multiple webpages. Presently they are doing it using Perl:LWP and RegEX on Win32. As there is no option for a commerical software, please let me know your views and recommendations on any solutions that would address the need. Is PERL:LWP module sufficient enough ? or should I look for any .NET modules ?
Thanks
July 5th, 2007 at 5:07 pm
[...] Several options are available, but oh so popular is why’s Hpricot. It’s fast and enjoyable (although I experienced no joy while learning how to use it =) It also happens to be used in some of the other scraping/navigating libraries (WWW::Mechanize [rdoc] and scRUBYt!). [...]
July 9th, 2007 at 9:01 am
I’ve been using Newbie Web Automation http://www.newbielabs.com and it does a pretty good job of scrapping data from websites. It support IE and Firefox. I’m interested to see if this Ruby data extraction tool would stack up.
Does it come with a debugger?
July 12th, 2007 at 3:56 am
[...] Auch wenn die Web-2.0-Welle mit ordentlich Getöse durch das Netz schwappt, gibt es viele Websites, besonders im deutschsprachigen Netz, die noch komplett auf dem Trockenen sitzen. Von offenen, remixbaren Daten z.B. in Form eines Webservices, haben viele Website-Betreiber noch nichts gehört oder sie streuben sich dagegen. Doch mit Screen- oder Web-Scraping sind nahezu alle Inhalte für Mashups nutzbar. Unter rubyrailways.com werden diverse interessante Ansätze und Bibliotheken für Ruby-Programmierer inklusive Vor- und Nachteilen gezeigt und verglichen. [...]
August 9th, 2007 at 5:24 am
Hi: This is a first class tutorial–very professionally presented. It was also very useful; i read every word. I thought i was at least a competent practitioner of this skill, but apparently i’m not! Additionally, whether it was your intention or not, i think that reading this article helps anyone who hopes to acquire fluency w/ scRUBYt, by providing the context, or the problems w/ current libraries and techniques that led to ScRUBTt development. After discovering it a few days ago, i’ve used scRUBYt several times on real problems on a professional project i’ve been working on for the past six months–scRUBYt worked as smoothly as a commercial app, no hitches. So i didn’t find any bugs, and i doubt i could offer any improvements that you or the Community hasn’t thought of already, but if that changes, i’ll post up. regards –doug
August 23rd, 2007 at 9:48 am
I have written a javascript too that is extremely efficent for web scraping. Check it out: http://www.feedmarklet.com/batchmarklet.html
September 6th, 2007 at 7:48 am
Hpricot will fail if the html has got errors. In that case you could use tidy like this
agent = WWW::Mechanize.new;
Page = agent.get(”http address”)
html = Page.body # Convert to Html from pure hpricot elements
Tidying up the html as there are errors
xml = Tidy.open(:showwarnings=>true) { |tidy|
tidy.options.outputxml = true
puts tidy.options.show_warnings
xml = tidy.clean(html)
#puts tidy.errors
#puts tidy.diagnostics
xml
}
Convert to Hpricot Document
doc = Hpricot(xml);
do rest of html processing
September 16th, 2007 at 11:39 pm
Hi
Very interesting information! Thanks!
Bye
September 25th, 2007 at 3:54 am
Great tutorial!
Just two comments: the first example does not return any results, I think it’s because Google now returns the “class” part after the “href”.
And on the last example, the last line throws that error:
scraping006.rb:15: uninitialized constant Scrubyt::ResultDumper (NameError)
I’m on Ubuntu 7, ruby 1.8.5
September 25th, 2007 at 4:17 am
Jaime,
Yeah, the first problem is a classic for web scraping: if the source changes, your scraper stops to work. There are several solutions for this problem (starting with the most primitive, recoding your scraper, up to sophisticated AI heuristics including scraper adaption, machine learning etc). Thanks for noting it though, I’ll update it soon.
As for the second problem: that’s fine - ResultDumper was dropped due to a rewrite and should be back in the future. However, it’s nothing big, it just showed some statistics of the results (like the link pattern matched 10 results etc). You can ignore it for now.
November 8th, 2007 at 8:58 am
[...] read more | digg story [...]
November 18th, 2007 at 11:12 pm
[...] CuRL- Ex [...]
December 13th, 2007 at 11:28 am
yadayada yada
December 13th, 2007 at 11:28 am
http://www.dinamis.eu
March 2nd, 2008 at 8:00 pm
I landed on this site the other day while searching for screen scraping. I wanted to write a screen scraper to monitor the status page of my DSL modem because AT&T service has been exceptionally poor lately, and I felt I might gather some valuable or at least interesting information by logging the status for a few days. So, I tried each example, some worked some didn’t. The WWW::Mechanize example worked and returned search results from http://www.google.com. Cool, not quite what I wanted, but cool. I only ran it twice, once I ran the example exactly as it was on the page and a second time I run it with an different search value. Then, I moved on with the Ruby learning and finally completed my modem status page scraper, which coincidently was my first Ruby program. Now, Google has put my IP address on some sort of blacklist. I cannot conduct a search without first solving a CAPTCHA, then after they’ve updated a cookie in my browser, I’m good to go. If I clear my browser’s cookies, I get the Google Error page and again must enter the CAPTCHA. If I go through a proxy, no CAPTCHA. If anyone else has encountered this problem, do share. I don’t think it is a coincidence. I do hope my IP is erased from this supposed blacklist soon. It is such an annoyance.
March 22nd, 2008 at 7:29 am
Thanks for your very well conceived and executed tutorial. I particularly appreciated your putting the various tools into context so that as a beginner I can make an informed decision about which to invest the time in learning.
I think that Ruby would be more widely and effectively used if there were more tutorials providing this kind of detailed and substantive overview of various problem domains.
Thanks again for this most helpful tutorial.
May 9th, 2008 at 4:38 am
acc617acdafa…
acc617acdafa2014c6f3…
May 10th, 2008 at 5:11 pm
I’m not sure if this is the forum for my question. I’m new to Ruby. I’m looking for ways to submit web forms and save the resulting web page in a pdf file.
So I go to https://www.some-site.com (yes https); I click on a “start” button and a new page with a form to fill is displayed; I fill in the form using data from a csv file; I click on a button to submit the form; a new confirmation page is displayed. I want to save this confirmation page in a pdf file.
I want to do this in Windows using simple Ruby scripts (without AJAX or RAILS or VB, etc.) Using just Ruby scripts (I think I used Watir too) and IE 6, I am able to do the form submit and navigation. However, I can’t seem to find a simple way to save the last page into a pdf file.
I tried using the IE “print” function (CTRL-P) but I can only get the IE print dialog to come up; I don’t know how to supply the file name for the pdf printer’s “save as” pop-up window. Any ideas?
Thanks.
John
July 7th, 2008 at 1:01 am
Привет всем!:) В интернете множество порно-сайтов, в которых при скачивании требуются разные активационные коды или нужно пускать смс на номера,
не зная сколько вас за это сдерут! Недолго думая, я решил создать сайт,
все скачивания с которого будут бесплатными! В этом сайте вы можете найти всё что захотите, даже добавил раздел: Книги!
Ещё один плюс, сайт постоянно обновляется! Кому стало интересно, прошу зайти по этой ссылке
July 19th, 2008 at 5:03 pm
Lineage II Hellbound это многопользовательская игра последнего поколения.
В игре одновременно могут участвовать несколько тысяч персонажей контролируемых людьми.
Средневековый, сказочный мир, наполненный чудесами и опасностями, монстрами и героями откроется для Вас.
По ходу игры Ваш персонаж набирается опыта, и ему становятся доступны новые умения, оружие, заклинания.
В вашей воле быть магом или воином, проводить время в боях с монстрами или окунуться в мир политики кланов.
В Lineage 2 Вы сможете поучаствовать в битвах с драконами, когда только слаженные действия команды из нескольких
десятков человек могут гарантировать успех, сможете осаждать замки, либо оказаться в рядах защитников стен,
завести своего собственного птенца дракона и вырастить его до огромного летающего монстра, на спине которого сможете летать по миру.
Для регистрации аккаунта и скачки клиента игры La2 Hellbound используйте наш сайт http://la2.hippo.ru/
Приятный игры.
September 26th, 2008 at 3:02 pm
Cool
http://www.tuvinh.com
October 20th, 2008 at 11:56 am
[...] world of screen-scraping as it is called, doesn’t end there. If you need more advanced techniques for screen scraping a page, behold the power of the [...]
December 10th, 2008 at 11:34 am
Very good Article.
http://scrappingexpert.com
Web Data extraction Specialist.
December 12th, 2008 at 10:05 am
I admire you on the willingness to share this info with others - good luck!
February 10th, 2009 at 4:11 am
great article,
But does it works fine with sites that uses JavaScript ?
February 10th, 2009 at 4:28 am
@Dlip: Sure, scRUBYt! does. Check out http://scrubyt.org.
November 4th, 2009 at 3:51 am
I am also writing an new Article about the data extraction of web screen. So thanks for the base of your article, I will link to it
Eric
January 26th, 2010 at 9:54 pm
The information presented is top notch. I’ve been doing some research on the topic and this post answered several questions.
March 16th, 2010 at 6:56 pm
Thanks for your very well conceived and executed tutorial. I particularly appreciated your putting the various tools into context so that as a beginner I can make an informed decision about which to invest the time in learning.
June 20th, 2010 at 8:26 am
Hi, I found your site by googling for Manic Miner. Have you seen the cool clothes at manicminer.se