Data extraction for Web 2.0: Screen scraping in Ruby/Rails

Update: A lot of things happened since the publication of this article. First of all, I have updated this article with HPricot and scRUBYt! examples – then I wrote the second part, I hacked up a Ruby web-scraping toolkit, scRUBYt! which also has a community web page – check it out, it’s hot right now!

Introduction

Despite of the ongoing Web 2.0 buzz, the absolute majority of the Web pages
are still very Web 1.0: They heavily mix presentation with content.
[1]
This makes hard or impossible for a computer to tell
off the wheat from the chaff: to sift out meaningful data from the rest of the elements
used for formatting, spacing, decoration or site navigation.

To remedy this problem, some sites provide access to their content
through APIs (typically via web services), but in practice nowadays this is
limited to a few (big) sites, and some of them are not even free or public.
In an ideal Web 2.0 world, where data sharing and site interoperability is one of
the basic principles, this should change soon(?) – but what should
one do if he needs the data NOW and not in the likely-to-happen-future?

Manic Miner

The solution is called screen/Web scraping or Web extraction – mining Web data
by observing the page structure and wrapping out the relevant records. In some
cases the task is even more complex than that: The data can be scattered over
more pages, triggering of a GET/POST request may be needed to get the input page
for the extraction or authorization may be required to navigate to the page of
interest. Ruby has solutions for these issues, too – we will take a look at them
as well.

The extracted data can be used in any way you like – to create mashups
(e.g. chicagocrime.org by Django author
Adrian Holovaty), to remix and present the relevant data
(e.g. rubystuff.com com by
ruby-doc.org
maintainer James Britt), to automatize
processes (for example if you have more bank accounts, to get the sum of the
money you have all together, without using your browser), monitor/compare
prices/items, meta-search, create a semantic web page out of a regular one –
just to name a few. The number of the possibilities is limited by your
imagination only.

Tools of the trade

In this section we will check out the two main possibilities (string and tree based
wrappers) and take a look at HTree, REXML, RubyfulSoup and WWW::Mechanize based
solutions.

String wrappers

The easiest (but in most of the cases inadequate) possibility is to view the
HTML document as a string. In this case you can use regular expressions to
mine the relevant data. For example if you would like to extract names
of goods and their price from a Web shop, and you know that they are
both in the same HTML element, like:

<td>Samsung SyncMasta 21''LCD		$750.00</td>

you can extract this record from Ruby with this code snippet:

scan(page, /<td>(.*)\s+(\$\d+\.\d{2})<\/td>/)

Let’s see a real (although simple) example:

1 require 'open-uri'

2 url = "http://www.google.com/search?q=ruby"
3 open(url) {
4   |page| page_content = page.read()
5   links = page_content.scan(/<a class=l.*?href=\"(.*?)\"/).flatten
6   links.each {|link| puts link}
7 }

The first and crucial part of creating the wrapper program was the observation of the
page source: We had to look for something that appears only in the result links.
In this case this was the presence of the ‘class’ attribute, with value ‘l’. This
task is usually not this easy, but for illustration purposes it serves well.

This minimalistic example shows the basic concepts: How to load the
contents of a Web page into a string (line 4), and how to extract the result
links on a google search result page (line 5). (After execution, the program
will list the first 10 links of a google search query for the word ‘ruby’ (line 6)).

However, in practice you will mostly need to extract data which are not
in a contiguous string, but contained in multiple HTML tags, or divided
in a way where a string is not the proper structure for searching. In
this case it is better to view the HTML document as a tree.[2]

Tree wrappers

The tree-based approach, although enables more powerful techniques,
has its problems, too: The HTML document can look very good in a browser,
yet still be seriously malformed (unclosed/misused tags). It is a
non-trivial problem to parse such a document into a structured format
like XML, since XML parsers can work with well-formed documents only.

HTree and REXML

There is a solution (in most of the cases) for this problem, too:
It is called HTree. This handy package is able
to tidy up the malformed HTML input, turning it to XML – the recent version is
capable to transform the input into the nicest possible XML from our point of view: a REXML
Document. (
REXML
is Ruby’s standard XML/XPath processing library).

After preprocessing the page content with HTree, you can unleash the
full power of XPath, which is a very powerful XML document querying language,
highly suitable for web extraction.

Refer to [3] for the installation instructions of HTree.

Let’s revisit the previous Google example:

1 require 'open-uri'
2 require 'htree'
3 require 'rexml/document'

4 url = "http://www.google.com/search?q=ruby"
5 open(url) {
6  |page| page_content = page.read()
7  doc = HTree(page_content).to_rexml
8  doc.root.each_element('//a[@class="l"]') {
        |elem| puts elem.attribute('href').value }  
9 }

HTree is used in the 7th line only – it converts the HTML page (loaded into the pageContent
variable on the previous line) into a REXML Document. The real magic happens
in the 8th line. We select all the <a> tags which have an attribute ‘class’ with the
value ‘l’, then for each such element write out the ‘href’ attribute. [4]
I think this approach is much more natural for querying an XML document than a regular expression.
The only drawback is that you have to learn a new language, XPath, which is (mainly from
version 2.0) quite difficult to master. However, just to get started you do not need to know
much of it, yet you gain lots of raw power compared to the possibilities offered by regular expressions.

Hpricot

Hpricot is “a Fast, Enjoyable HTML Parser for Ruby” by one of the coolest (Ruby) programmers of our century, why the lucky stiff. From my experience, the tag line is absolutely correct – Hpricot is both very fast (thanks to a C based scanner implementation) and really fun to use.
It is based on HTree and JQuery, thus it can provide the same functionality as the previous Htree + REXML combination, but with a much better performance and greater ease of use. Let’s see the google example again – I guess you will understand instantly what I mean!

1 require 'rubygems'
2 require 'hpricot'
3 require 'open-uri'

4 doc = Hpricot(open('http://www.google.com/search?q=ruby'))
5 links = doc/"//a[@class=l]"
6 links.map.each {|link| puts link.attributes['href']}

Well, though this was slightly easier than with the tools seen so far, this example does not really show the power of Hpricot – there is much, much, much more in the store: different kinds of parsing, CSS selectors and searches, nearly full XPath support, and lots of chunky bacon! If you are doing something smaller and don’t need the power of scRUBYt!, my advice is to definitely use Hpricot from the tools listed here. For more information, installation instructions, tutorials and documentation check out Hpricot’ s homepage!

RubyfulSoup

Rubyfulsoup is a very powerful Ruby
screen-scraping package, which offers
similar possibilities like HTree + XPath. For people who are not handy with XML/XPath,
RubyfulSoup may be a wise compromise: It’s an all-in-one, effective HTML parsing
and web scraping tool with Ruby-like syntax. Although it’s expressive power
lags behind XPath2.0, it should be adequate in 90% of the cases. If your problem is in the
remaining 10%, you probably don’t need to read this tutorial anyway ;-)
Installation instructions can be found here: [5].

The google example again:

1  require 'rubygems'
2  require 'rubyful_soup'
3  require 'open-uri'

4  url = "http://www.google.com/search?q=ruby"  
5  open(url) { 
6    |page| page_content = page.read()
7    soup = BeautifulSoup.new(page_content)
8    result = soup.find_all('a', :attrs => {'class' => 'l'}) 
9    result.each { |tag| puts tag['href'] }
10 }

As you can see, the difference between the HTree + REXML and RubyfulSoup examples is minimal –
basically it is limited to differences in the querying syntax. On line 8, you look up all the
<a> tags, with the specified attribute list (in this case a hash with a single pair { ‘class’ => ‘l’ } )
The other syntactical difference is looking up the value of the ‘href’ attribute on line 9.

I have found RubyfulSoup the ideal tool for screen scraping from a single page – however web navigation
(GET/POST, authentication, following links) is not really possible or obscure at best with
this tool (which is perfectly OK, since it does not aim to provide this functionality). However, there
is nothing to fear – the next package is doing just exactly that.

WWW::Mechanize

As of today, prevalent majority of data resides in the deep Web – databases, that
are accessible via querying through web-forms. For example if you would like to get information
on flights from New York to Chicago, you will (hopefully) not search for it on google –
you go to the website of the Ruby Airlines instead, fill in the adequate fields and click on search.
The information which appears is not available on a static page – it’s looked up on demand and
generated on the fly – so until the very moment the web server generates it for you , its practically
non-existent (i.e. it resides in the deep Web) and hence impossible to extract. At this point
WWW::Mecahnize comes into play.
(See [6] for installation instructions)

WWW::Mechanize belongs to the family of screen scraping products (along with http-access2 and Watir)
that are capable to drive a browser. Let’s apply the ‘Show, don’t tell’ mantra – for everybody’s delight
and surprise, illustrated on our google scenario:

require 'rubygems'
require 'mechanize'

agent = WWW::Mechanize.new
page = agent.get('http://www.google.com')

search_form = page.forms.with.name("f").first
search_form.fields.name("q").first.value = "ruby"
search_results = agent.submit(search_form)
search_results.links.each {
     |link| puts link.href if link.class_name == "l" }

I have to admit that i have been cheating with this one ;-) . I had to hack WWW::Mechanize to
access a custom attribute (in this case ‘class’) because normally this is not available.
See how i did it here: [7]

This example illustrates a major difference between RubyfulSoup and Mechanize: additionally to screen scraping
functionality, WWW::mechanize is able to drive the web browser like a human user: It filled in the
search form and clicked the ‘search’ button, navigating to the result page, then performed screen scraping
on the results.

This example also pointed out the fact that RubyfulSoup – although lacking navigation possibilities -
is much more powerful in screen scraping. For example, as of now, you can not extract arbitrary (say <p>)
tags with Mechanize, and as the example illustrated, attribute extraction is not possible either – not to
mention more complex, XPath like queries (e.g. the third <td> in the second <tr>) which is easy with
RubyfulSoup/REXML. My recommendation is to combine these tools, as pointed out in the last section of this article.

scRUBYt!

scRUBYt! is a simple to learn and use, yet very powerful web extraction framework written in Ruby, based on Hpricot and Mechanize. Well, yeah, I made it :-) so this is kind of a self promotion, but I think (hopefully not just because being overly biased ;-) ) it is the most powerful web extraction toolkit available to date. scRUBYt! can navigate through the Web (like clicking links, filling textfields, crawling to further pages – thanks to mechanize), extract, query, transform and save relevant data from the Web page of your interest by the concise and easy to use DSL (thanks to Hpricot and a lots of smart heuristics).

OK, enough talking – let’s see it in action! I guess this is rather annoying now for the 6th time, but let’s revisit the google example once more! (for the last time, I promise :-)

1  require 'rubygems'
2  require 'scrubyt'

3  google_data = Scrubyt::Extractor.define do
4    fetch          'http://www.google.com/ncr'
5    fill_textfield 'q', 'ruby'
6    submit
       
7    result 'Ruby Programming Language' do
8      link 'href', :type => :attribute
9    end
10 end

11 google_data.to_xml.write($stdout, 1)
12 Scrubyt::ResultDumper.print_statistics(google_data) 

Oputput:

  <root>
    <result>
      <link>http://www.ruby-lang.org/</link>
    </result>
    <result>
      <link>http://www.ruby-lang.org/en/20020101.html</link>
    </result>
    <result>
      <link>http://en.wikipedia.org/wiki/Ruby_programming_language</link>
    </result>
    <result>
      <link>http://en.wikipedia.org/wiki/Ruby</link>
    </result>
    <result>
      <link>http://www.rubyonrails.org/</link>
    </result>
    <result>
      <link>http://www.rubycentral.com/</link>
    </result>
    <result>
      <link>http://www.rubycentral.com/book/</link>
    </result>
    <result>
      <link>http://www.w3.org/TR/ruby/</link>
    </result>
    <result>
      <link>http://poignantguide.net/</link>
    </result>
    <result>
      <link>http://www.zenspider.com/Languages/Ruby/QuickRef.html</link>
    </result>
  </root>

    result extracted 10 instances.
        link extracted 10 instances.

You can donwload this example from here.

Though the code snippet is not really shorter, maybe even longer than the other ones, there are a lots of thing to note here: First of all, instead of loading the page directly (you can do that as well, of course), scRUBYt allows you to navigate there by going to google, filling the appropriate text field and submitting the search. The next interesting thing is that you need no XPaths or other mechanism to query your data – you just copy’n’ paste some examples from the page, and that’s it. Also, the whole description of the scraping process is more human friendly – you do not need to care about URLs, HTML, passing the document around, handling the result – everything is hidden from you and controlled by scRUBYt!’s DSL instead. You even get a nice statistics on how much stuff was extracted. :-)

The above example is just the top of the iceberg – there is much, much, much more in scRUBYt! than what you have seen so far. If you would like to know more, check out the tutorials and other goodies on scRUBYt!’s homepage.

WATIR

From the WATIR page:

WATIR stands for “Web Application Testing in Ruby”. Watir drives the Internet Explorer browser the same
way people do. It clicks links, fills in forms, presses buttons. Watir also checks results, such as whether
expected text appears on the page.

Unfortunately I have no experience with WATIR since i am a linux-only nerd, using windows for occasional
gaming but not for development, so I can not tell anything about it from the first hand, but judging from the
mailing list contributions i think Watir is more mature and feature-rich than mechanize. Definitely
check it out if you are running on Win32.

The silver bullet

For a complex scenario, usually an amalgam of the above tools can provide the ultimate solution:
The combination of WWW::Mechanize or WATIR (for automatization of site navigation), RubyfulSoup (for serious screen
scraping, where the above two are not enough) and HTree+REXML (for extreme cases where even RubyfulSoup
can’t help you).

I have been creating industry-strength, robust and effective screen scraping solutions in the last five years
of my career, and i can show you a handful of pages where even the most sophisticated solutions do not work (and
i am not talking about scraping with RubyfulSoup here, but even more powerful solutions (like embedding
mozilla in your application and directly accessing the DOM etc)). So the basic rule is: there is no
spoon (err… silver bullet) – and i know by experience that the number of ‘hard-to-scrap’ sites is rising
(partially because of the Web 2.0 stuff like AJAX, but also because some people would not like their sites to
be extracted and apply different anti-scraping masquerading techniques).

The described tools should be enough to get you started – additionally, you may have to figure out how to
drill down to your stuff on the concrete page of interest.

In the next installment of this series, i will create a mashup application using the introduced tools, from some
more interesting data than google ;-)
The results will be presented on a Ruby on Rails powered page, in a sortable AJAX table.





If you liked the article, subscribe to the rubyrailways.com feed!  




Creating a site for a ruby on rails tutorials is a great way to market the fairly new language. Setting up a site should be very simple. Use the engines to search domains for a relevant domain name. Search for dedicated servers for cheap hosting plans to get efficient service and extra web space. Use a wireless internet to upload the site conveniently, trying hiring a company that hires people with 642-586or at the least ccna certification. Look into ibm certification yourself to increase productivity.


[1] There are a lot of other issues (social aspect, interoperability, design principles
etc.), but these are falling out of scope of the current topic.Back


[2] However, if the problem can be relatively easily tackled with regular expressions, it’s usually good
to use them for several reasons: No additional packages are needed (this is even more important if you don’t have
install rights), you don’t have to rely on the HTML parser’s output and if you can use regular expressions, it’s
usually the easier way to do so. Back


[3] Install HTree:
wget http://cvs.m17n.org/viewcvs/ruby/htree.tar.gz (or download it from your browser)
tar -xzvf htree.tar.gz
sudo ruby install.rb Back


[4] There are plenty other (possibly smarter) ways to do this, for example using
each_element_with_attribute, or a different, more effective XPath – I have chosen to use
this method to get as close to the regexp example as possible, so it is easy to observe
the difference between the two approaches for the same solution. For a real REXML tutorial/documentation
visit the REXML site.
Back


[5] The easiest way is to install rubyful_soup from a gem:
sudo gem install rubyful_soup
Since it was installed as a gem, don’t forget to require ‘rubygems’ before requiring rubyful_soup.
Back


[6] sudo gem install mechanize
Back


[7] I have added two lines to WWW::Mechanize source file page_elements.rb:

To the class definition:

attr_reader :class_name

Into the constructor:

@class_name = node.attributes['class']

197 thoughts on “Data extraction for Web 2.0: Screen scraping in Ruby/Rails

  1. Right here is the perfect blog for everyone who hopes to find out about
    this topic. You realize so much its almost hard to argue with
    you (not that I personally would want to…HaHa).
    You definitely put a brand new spin on a topic that has been written about for ages.
    Great stuff, just wonderful!

  2. I loved as much as you’ll receive carried out right here.
    The sketch is tasteful, your authored material
    stylish. nonetheless, you command get bought an impatience over that you wish be delivering the following.
    unwell unquestionably come more formerly again since exactly the same nearly very often inside case you shield this hike.

  3. There are highly advanced and innovative tools which might be provided by technology to understand
    this purpose. 2 megapixel resolution to use,
    a CMOS image sencer, a focal length of 18 to 55, an ISO 100- TSO 3200 sensitivity, full HD recording facility and a lot more.
    It entirely possible that every tablet PC manufacturer looks on i –
    Pad 3 as his or her biggest competitor and can try their very best to create out a great device to get over i – Pad
    3.

  4. Thanks a bunch for sharing this with all of us you really know what you are speaking
    approximately! Bookmarked. Kindly additionally discuss with my web site =).
    We may have a link change contract between us

  5. I was recommended this website by wa? ?ff my cousin. I aam nno
    loger ?ure whether this publisdh ?s written by him as no one els? recognise ?uch designatted approxmately m? trouble.

    You arre amazing! Thanks!

    Feel free t?o surf to m? weblog pokemon go hack

  6. For instance, when Herpes simplex virus -1 remains
    in a dormant state, the virus typically develops its home
    in the afferent neuron near the ear, so it is more likely to cause an outbreak around the mouth.

  7. Can I simply say what a comfort to uncover an individual who actually understands what they’re discussing on the net.
    You definitely understand how to bring a problem to light and make it
    important. More people must check this out and understand this side of your
    story. I was surprised that you aren’t more popular given that you definitely
    possess the gift.

  8. Net two. describes Entire world Huge Web websites
    that emphasize consumer-produced content material, usability (ease of use, even by non-specialists), and interoperability (this signifies
    that a web-site can do the job very well with other products and solutions, programs and
    products) for conclude buyers. The term was popularized by Tim O’Reilly and Dale Dougherty at the
    O’Reilly Media World-wide-web two. Meeting in late 2004, even though it was coined by Darcy DiNucci in 1999.[one][two][three][four] Web 2.
    does not refer to an update to any specialized
    specification, but to adjustments in the way
    World-wide-web pages are made and utilised.

  9. After looking into a number of the blog posts on your website, I seriously
    like your technique of blogging. I saved as a favorite it to my bookmark website list and will be checking
    back soon. Please visit my web site too and tell me how you feel.

  10. We parked in the woods and walked out and I’ve been to festivals, but I have never ever
    witnessed anything like it. You know if a person is coming to
    Bonnaroo they are a true music lover, since you have to commit.

  11. You really make it seem so easy along with your presentation but I find this matter to be actually one thing that I feel
    I would never understand. It sort of feels too complicated and very broad for me.

    I am taking a look forward in your subsequent put up, I will attempt to get the hold of it!

  12. The 48 yr previous captained the profitable European side at the 2016 EurAsia Cup, and
    his first encounter of currently being a staff captain need to stand him in excellent
    stead when he heads to Hazeltine to lead Europe into battle.

  13. Do you mind if I quote a few of your articles as long
    as I provide credit and sources back to your website?
    My blog is in the exact same area of interest as yours and
    my users would really benefit from some of the information you present here.
    Please let me know if this okay with you. Thanks a lot!

  14. Ama?ing ?log! Do you have any ?ecommend?tions for aspiring writers?
    I’m hoping to start my own skte soon but I’m a little lost on everyt?ing.
    Would you advise starting with a free platform li?e WordPress or go ffor a paid ?ption? There are so many options out there that I’m completely confused ..
    Any tip?? Chee?s!

  15. I think that what you typed made a great deal of sense.
    However, think on this, suppose you added a little content?
    I mean, I don’t wish to tell you how to run your website, however what if you added something that makes people desire more?
    I mean Data extraction for Web 2.0: Screen scraping in Ruby/Rails | Ruby, Rails, Web2.0 is kinda plain. You ought to glance
    at Yahoo’s home page and note how they create article headlines to get people interested.
    You might try adding a video or a related picture or
    two to grab people interested about what you’ve got to say.
    Just my opinion, it would make your blog a little bit more interesting.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>