Data Extraction for Web 2.0: Screen Scraping in Ruby/Rails, Episode 1

This article is a follow-up to the quite popular first part on web scraping – well, sort of. The relation is closer to that between Star Wars I and IV – i.e., in chronological order, the 4th comes first. To continue the analogy, probably I am in the same shoes as George Lucas was after creating the original trilogy : the series became immensely popular and there was demand for more – in both quantity and depth.

After I have realized – not exclusively, but also – through the success of the first artcile that there is need for this sort of stuff, I begun to work on the second part. As stated at the end of the previous installment, I wanted to create a demo web scraping application to show some advanced concepts. However, I left out a major coefficient from my future-plan-equation: the power of Ruby.

Basically this web scraping code was my first serious Ruby program: I came to know Ruby just a few weeks earlier, and I have decided to try it out on some real-life problem. After hacking on this app for a few weeks, suddenly a reusable web scraping toolkit – scRUBYt! – begun to materialize which caused a total change of the plan: instead of writing a follow-up, I decided to finish the toolkit and sketch a big picture of the topic as well as placing scRUBYt! inside this frame and illustrating the theoretical things with it described here.

The Big Picture: Web Information Acquisition

The whole art of systematically getting information from the Web is called ‘Web information acquisition’ in the literature. The process consists of 4 parts (see the illustration), which are executed in this order: Information Retrieval (IR), Information Extraction(IE), Information Integration (II) and Information Delivery (ID).

Information Retrieval

Navigate to and download the input documents which are the subject of the next steps. This is probably the most
intuitive step to make – clearly, the information acquisition system has to be pointed to the document which contains the data first, before it can perform the actual extraction.

The absolute majority of the information on the Web resides in the so-called deep web – backend databases and different legacy data stores which are not contained in static web documents. This data is accessible via interaction with web pages (which serve as a frontend to these databases) – by filling and submitting forms, clicking links, stepping through wizards etc. A typical example could be an airpot web page: an airport has all the schedules of the flights they offer in their databases, yet you can access this information only on the fly by submitting a form containing your concrete request.

The opposite of the deep web is the surface web – static pages with a ‘constant’ URL, like the very page you are reading. In such a case, the information retrieval step consist of just downloading the URL. Not a really tough task.

However, as I said two paragraphs earlier, most of the information is stored in the deep web – different actions, like filling input fields, setting checkboxes and radio buttons, clicking links etc. are needed to get to the actual page of interest which can be then downloaded as the result of navigation.

Besides that this is not trivial to do automatically from a programming language just because of the nature of the task, there are a lot of pitfalls along the way, stemming from the fact that the HTTP protocol is stateless: the information provided to a request is lost when making the next request. To remedy this problem, sessions, cookies, authorizations, navigation history and other mechanisms were introduced – so a decent information retrieval module has to take care about these as well.

Fortunately, in Ruby there are packages which are offering exactly this functionality. Probably the most well-known is WWW::Mechanize which is able to automatically navigate through Web pages as a result of interaction (filling forms etc.) while keeping cookies, automatically following redirects and simulating everything else what a real user (or the browser in response to that) would do. Mechanize is awesome – from my perspective it has one major flaw: you can not interact with JavaScript websites. Hopefully this feature will be added soon.

Until that happy day, if someone wants to navigate through JS powered pages, there is a solution: (Fire)Watir. Watir is capable to do similar things as Mechanize (I never did a head-to-head comparison, though it would be interesting) with the added benefit of JavaScript handling.

scRUBYt! comes with a navigation module, which is built upon Mechanize. In the future releases I am planning to add FireWatir, too (just because of the JavaScript issue). scRUBYt! is basically a DSL for web scraping with lot of heavy lifting behind the scenes. Through the real power lies the extraction module, there are some goodies here at the navigation module, too. Let’s see an example!

Goal: Go to amazon.com. Type ‘Ruby’ into the search text field. To narrow down the results, click ‘Books’, then for further narrowing ‘Computers & Internet’ in the left sidebar.

Realization:

  fetch           'http://www.amazon.com/'
  fill_textfield  'field-keywords', 'ruby'
  submit
  click_link      'Books'
  click_link      'Computers & Internet'

Result: This document.

As you can see, scRUBYt’s DSL hides all the implementation details, making the description of the navigation as easy as possible. The result of the above few lines is a document – which is automatically fed into the scraping module, but this is already the topic of the next section.

Information Extraction

I think there is no need to write about why does one need to extract information from the Web today – the ‘how’ is a much more interesting question.

Why is Web extraction such a tedious task? Because the data of interest is stored in HTML documents (after navigating to them, that is), mixed with other stuff like formatting elements, scripts or comments. Because the data is missing any semantic description, a machine has no idea what a web shop record is or how a news article might look like – it just perceives the whole document as a soup of tags and text.

Querying objects in systems which are formally defined and thus understandable for a machine is easy: For instance, if I want to get the first element of an array in Ruby, One can do it easily like this:

my_array.first

Another example for a machine-queryable structure could be an SQL table: to pull out the elements matching the given criteria, all that needs to be done is to execute an SQL query like this:

SELECT name FROM students WHERE age > 25

Now, try to do similar queries for a Web page. For example, suppose that you already navigated to an ebay page by searching for the term ‘Notebook’. Say you would like to execute the following query: ‘give me all the records with price lower than $400’ (and get the results into a data structure of course – not rendered inside your browser, since that works naturally without any problems).

The query was definitely an easy one, yet without implementing a custom script extracting the needed information and saving it to a data structure (or using stuff like scRUBYt! – which does exactly this instead of you) you have no chance to get this information from the source code.

There are ongoing efforts to change this situation – most notably the semantic Web, common ontologies, different Web2.0 technologies like taxonomies, folksonomies, microformats or tagging. The goal of these techniques is to make the documents understandable for machines to eliminate the problems stated above. While there are some promising results in this area already, there is a long way to go until the whole Web will be such a friendly place – my guess is that this will happen around Web88.0 in the optimistic case.

However, at the moment we are only at version 2.0 (at most), so if we would like to scrape a web page for whatever reason *today*, we need to cope with the difficulties we are facing. I wrote an overview on how to do this with the tools available in Ruby (update: there is a new kid on the block – HPricot – which is not mentioned there).

The rough idea of those packages is to parse the Web page source into some meaningful structure (usually a tree) then provide a querying mechanism (like XPaths, CSS selectors or some other tree navigation model). You could think now: ‘A-ha! So actually a web page *can* be turned into something meaningful for machines, and there *is* a formal model to query this structure – so where is the problem described in the previous paragraphs? You just write queries like you would in a case of a database, evaluate them against the tree or whatever and you are done’.

The problem is that the machine’s understanding of the page and human thinking about querying this information are entirely different, and there is no formal model (yet) to eliminate this discrepancy. Humans want to scrape ‘websop records with Canon cameras with maximal price $1000’, while the machine sees this as ‘the third <td> tag inside the eight <tr> tag inside the fifth <table> … (lot of other tags) inside the <body>> tag inside the <html> tag, where the text of the seventh <td> tag contains the string ‘Canon’ and the text of the ninth <td>, is not bigger than 1000 (to even get the value 1000 you have to use a regular expression or something to get rid of the most probably present currency symbol and other possible additional information).

So why is this so easy with a database? Because the data stored in there has a formal model (specified by the CREATE TABLE keyword). Both you and the computer know *exactly* how a Student or a Camera looks like, and both of you are speaking the same language (most probably an SQL dialect).

This is totally different in the case of a Web page. A web shop record, a camera detail page or a news item can look just anyhow and your only chance to find out for the concrete Web page of interest is to exploit it’s structure. This is a very tedious task on it’s own (as I have said earlier, a Web page is a mess of real data, formatting, scripts, stylesheet information…). Moreover there are further problems: for example, a web shop record must not be uniform even inside the same page – certain records can miss some cells which others have, may containt the information on a detail page, while others not and vice versa – so in some cases, identifying a data model is impossible or very complicated – and I did not even talk about scraping the records yet!

So what could be the solution?

Intuitively, there is a need for an interpreter which understands the human query and translates it to XPaths (or any querying mechanism a machine understands). This is more or less what scRUBYt! does. Let me explain how – it will be the easiest through a concrete example.

Suppose you would like to monitor stock information on finance.yahoo.com! This is how I would do it with scRUBYt!:

#Navigate to the page
fetch ‘http://finance.yahoo.com/’

#Grab the data!
stockinfo do
symbol ‘Dow’
value ‘31.16’
end

output:

  <root>
    <stockinfo>
      <symbol>Dow</symbol>
      <value>31.16</value>
    </stockinfo>
    <stockinfo>
      <symbol>Nasdaq</symbol>
      <value>4.95</value>
    </stockinfo>
    <stockinfo>
      <symbol>S&P 500</symbol>
      <value>2.89</value>
    </stockinfo>
    <stockinfo>
      <symbol>10-Yr Bond</symbol>
      <value>0.0100</value>
    </stockinfo>
  </root>

Explanation: I think the navigation step does not require any further explanation – we fetched the page of interest and fed it into the scraping module.

The scraping part is more interesting at the moment. Two things happened here: we have defined a hierarchical structure of the output data (like we would define an object – we are scraping StockInfos which have Symbol and Value fields, or children), and showed scRUBYt! what to look for on the page in order to fill the defined structure with relevant data.

How did I know I had to specify ‘Dow’ and ‘31.16’ to get these nice results? Well, by manually pointing my browser to ‘http://finance.yahoo.com/’, and observing an example of the stuff I wanted to scrape – and leave the rest to scRUBYt!. What actually happens under the hood is that scRUBYt! finds the XPath of these examples, figures out how to extract the similar ones and arranges the data nicely into a result XML (well, there is much more going on, but this is the rough idea). If anyone is interested, I can explain this in a further post.

You could think now ‘O.K., this is very nice and all, but you have been talking about *monitoring* and I don’t really see how – the value 31.16 will change sooner or later and then you have to go to the page and re-specify the example again – I would not call this monitoring’.

Great observation. It’s true scRUBYt! would not be of much use if the situation of changing examples would not be handled (unless you would like to get the data only once, that is) – fortunately, the situation is dealt with in a powerful way!

Once you run the extractor and you think the data it scrapes is correct, you can export it. Let’s see how the exported finances.yahoo.com extractor looks like:

#Navigate to the page
fetch ‘http://finance.yahoo.com/’

#Construct the wrapper
stockinfo “/html/body/div/div/div/div/div/div/table/tbody/tr” do
symbol “/td[1]/a[1]”
value “/td[3]/span[1]/b[1]”
end

As you can see, there are no concrete examples any more – the system generalized the information and now you can use this extractor to scrape the information automatically whenever – until the moment the guys at yahoo change the structure of the page – which fortunately not happening every other day. In this case the extractor should be regenerated with up-to date examples (in the future I am planning to add automatic regeneration in such cases) and the fun can begin from the start once again.

This example just scratched the surface of what scRUBYt is capable of – there are tons of advanced stuff to fine-tune the scraping process and get the data you need. If you are interested, check out http://scrubyt.org for more information!

Conclusion

The first two steps of information acquisition (retrieval and extraction) are dealing with the question ‘How to get the data I am interested in (querying)’. Up to the present version (0.2.0) scRUBYt! contains just these two steps – however, to do even these properly, I will need a lot of testing, feedback, bug fixing, stabilization, adding heaps of new features and enhancements – because as you have seen, web scraping is not a straightforward thing to do at all.

The last two steps (integration and delivery) are addressing the question ‘what to do with the data once it is collected, and how to do that (orchestration)’. These facets will be covered in a next installment – most probably when scRUBYt! will contain these features as well.

If you liked this article and you are interested in web scraping in practice, be sure to install scRUBYt! and check out the community page for further instructions – the site is just taking off, so there is not too much yet – but hopefully enough to get you started. I am counting on your feedback, suggestions, bug reports, extractors you have created etc. to enhance both scrubyt.org and scRUBYt! user experience in general. Be sure to share your experience and opinion!

To launch a tutorial site is comparatively much easier today than it was a few years ago. You can easily buy domain name at a very low cost and do domain parking until your site is ready. Get a good business hosting package from one of the many providers listed on the internet, go for a company which hires people with cisco certifications such as 642-143. Create a professional web design with the help of adobe. Get online training that can guide you through the site’s development. Use your laptop wireless internet connection to upload from anywhere conveniently.

40 thoughts on “Data Extraction for Web 2.0: Screen Scraping in Ruby/Rails, Episode 1

  1. Yeah, I have been lamenting a lot over this myself when I wrote the article… Then I voted for 5 for illustration purposes – however, it seems it was not a very good idea 🙂 Anyway, I have solved it with a Solomonic decision I guess 🙂

  2. Pingback: scRUBYt - Hot, New Ruby Web-Scraping Toolkit Released

  3. Pingback: Elliott’s blog » Blog Archive » scRUBYt - Hot, New Ruby Web-Scraping Toolkit Released

  4. How would you use scrubyit to go to a billing site (let’s say cell phone bill) and download pdf statement automatically?

  5. El: It depends on the concrete site very much, from this info it is impossible to tell that…

  6. Pingback: links for 2007-02-08 « Donghai Ma

  7. This will be available from the next release on. The syntax will be:

    fetch :proxy => 'my.cool.pro.xy', 'http://www.google.com'
    

    Please subscribe to the feed at scrubyt.org if you would like to be mailed about new releases, tutorials and any information in general.

  8. Pingback: Innovatives Web-Grabbing Modul at Lost in Programming

  9. how cool this is my friend
    the truth is you are making screen scraping accessible so easy and convenient to anyone that it is difficult not to use it.. i for once i am using it to make ruby check the status of some ip phones with web interface

    thanks

  10. As you mentioned, it is possible to scrape data from tables on a website. Could you help explain it a bit more to me, as i have to scrape a 5 column table from a website and import the data into a mysql table.

    I really appreciate it, good work you are doing here.

    Thanks

  11. Pingback: Shoob » Blog Archive » links for 2007-03-25

  12. I’ve translated “Data Extraction for Web 2.0: Screen Scraping in Ruby/Rails, Episode 1” into Japanese.
    If you want to delete the translation, tell me so. I’ll follow you.

    http://playet.jugem.jp/?eid=12

    Because I am a newbie both in programming(not only Ruby, but Any launguages) and English, the translation could be lost something you would like to say…

    Anyone recognize Japanese, correct my translation if necessary.

  13. The site looks great ! Thanks for all your help ( past, present and future !)

  14. I know this is a little late, El, but there’s an app I use to do screen scraping and data extraction called screen-scraper. It handles going through forms, picking out data, downloading files, etc. and has a nice scripting engine that works with most popular languages. Can even be programmed with regular expressions :).

    Great article, Peter, scRUBYy! looks awesome.

  15. Thanks alan!

    Well, actually scRUBYt! knows all those things you have stated above (as for the ‘scripting engine that works with most popular languages’ – scRUBYt! is ‘just’ pure Ruby). I am neither trying to compete with screen-scraper (or other tools) nor to replicate or try to resemble to their feature set, so I do not mean to start a scraper war here…

    I know at least one guy (actually a gal) who used screen-scraper and switched to scRUBYt!, and AFAIK she is happy so far. On the other hand, she uses a Rails frontend to present the scraped data, so scRUBYt! was a natural fit for her. I guess if you are using Java tools, screen-scraper may be better.

    Bottom line: use the right tool for the right job! scRUBYt! is based on a different philosophy, has different features and is currently kind of immature – on the other hand it has the best community! 🙂

  16. you write “Once you run the extractor and you think the data it scrapes is correct, you can export it. Let’s see how the exported finances.yahoo.com extractor looks like:”

    is there a command for exporting this or i have to make this step by hand to find out the xpath?

  17. I am trying to get scrubyt to work with the Ruby on Rails framework. I saw some talk on the forums (that took place last year), but not any clear answers. And I haven’t seen anything about it posted lately. Do you know if anyone has it working with RoR? If so, could you point me in the right direction or in contact with someone that has it working. Thank you.

  18. Where is the installation file, I have been trying to find it with no success. I click on all the links and I don’t see any installation/download link. Is this is ruby gem ?

  19. Pingback: Recent Links Tagged With "scrubyt" - JabberTags

  20. I really love your blog!!

    Can anybody tell me what´s the best Hotel in Paris for my honey moon? I going to married next month..

    Thank you

  21. You made tremendous great ideas there. I made a search on the topic and noticed almost all peoples will agree with your blog. There is just one little restriction on it – it won’t work with the first-generation iPhone.

  22. Just wanted to say AWESOMEpost. Thanks and keep up the good work. Have time check out my GDI landing page. I have been working on building a downline and I have an AWESOME rinse and repeat process. I even have access to Xrummer and all sorts of FREE tools for you. But only to people on my team sorry.

  23. Greetings! I know this is somewhat off topic but I was wondering if you knew where I could locate a
    captcha plugin for my comment form? I’m using the same blog platform as yours and I’m having difficulty finding one?
    Thanks a lot!

  24. Online shopper tend to click on thee website that comes first on their search results.
    Google Custom Search is a free web based design tool that harnesses the searching
    power of Google for individual websites. It makes them
    helpful and it makes them really helpful with these
    things.

Leave a Reply

Your email address will not be published. Required fields are marked *