Short List of the Greatest Inventions of all Time

1. Sliced bread (1928, Otto Frederick Rohwedder)
2. Ruby (1995, Yukihiro “Matz” Matsumoto)
3. HPricot (2006, Why the lucky stiff)
4. Ruby on Rails (2005, David Heinemeier Hansson)

What a relief! I *really* had the urge to tell everybody how cool HPricot is, just did not know the way yet – until now. The cosmic balance is somewhat restored now that I blurted out this post :-).

Needless to say, you need to take this list with a tiny droplet of humor: Of course if we consider development time, amount and scope of offered solutions, innovation, community, book coverage etc. etc. then Rails is a clear winner (and anyway, the two players are not in the same league). However, HPricot is a great example of how a not-new-at-all thing can be made much more usable, fast and “heaps of fun to use” (really) just by clever design and usage of the right tools (and a dash of a cool programmer’s charisma). It is one thing to come up with a purple cow on a non-saturated market with lots of space for innovation, and a different story to do the same when everything has been already said and done. And _why did it. Again.

I am writing a (not so) small web extraction framework in Ruby (planned release XMAS 2006) which heavily relies on HPricot as the HTML query language – so I dare to say I know (at least some parts of) HPricot pretty well, yet it still keeps me totally amazed. What I like the most about it (besides that it is lightning-on-steroids fast compared to anything available for the same task, feature rich, reliable, stable etc. etc.) is that it takes the ‘principle of least surprise’ to the next level: I would call it ‘principle of almost no surprise’. If someone has a bit of knowledge about org.w3c.dom, XML, XPath, XSLT and/or has experience with other HTML/XML parsers/tools will have to refer to the documentation very rarely (of course there is a period of learning the basics and soaking into the HPricot-philosophy, but the learning curve is *really* steep).

Before I get to the proof that HPricot is able to solve the food problems in Africa or something, I need to cool down a bit :-): HPricot is not for everyone and not for every problem. If you need complex XPath evaluation for instance, you will have to stick with the good old REXML (for now , at least – I read that _why will add more XPath support and other goodies in the future). In the present version, you won’t be able to evaluate things like axes (e.g. ancestor::html) or XPath functions (e.g. normalize-space) and not even XPaths with indices (like html/body/table[1]/tr[2]/td[5] – though I wrote a small script to remedy this problem temporarily.)

There are a lots of HTML-extraction related questions on the Ruby mailing list (like how to extract every table cell from a <tablle:> etc.) My advice is to alwways check out HPricot first: Sometimes it can be an overkill to use it (if you can get what you want with a simple regexp, for example) but usually it is the right tool to parse and query even the ugliest HTML pages out there- unless you need heavy XPath/XQuery machinery which is rarely the case in the real life.

What else do I need to add? Great job, _why. Thanks man.

35 thoughts on “Short List of the Greatest Inventions of all Time

  1. Hpricot is truly lovely. No more handcrafted web scrapers more me – I feel like a professional web scraper with Hpricot. I feel like “this website was meant to give up its data” with Hpricot. I feel I don’t need microformats with Hpricot. Amen.

  2. Yeah, just by looking at the current web scraping arsenal available in Ruby, I have to second your opinion (well, with a small addition: FireWatir/Mechanize can come handy if I need to automatize some steps during the scraping).

    However, this will (hopefully) greatly change when I will release my web extraction framework ‘scRUBYt!’ (shameless self-promotion 🙂 but hey this is my blog…). It’s built on HPricot and Mechanize, but extended with a LOT of powerful features – I am working for a web extraction company for the fifth year now so I (hopefully :-)) have some ideas about how this should look like.

    But until then, surely HPricot is the king – and let’s see after the first release of scRUBYt!…

  3. Nice post… I really must find the time to get in amongst HPricot – I have some fun projects which I sort of gave up on because I couldn’t bear writing the scraper bit of them. Sounds like scRUBYt might be a good fit too! Look forwards to hearing more on that one too.

  4. Well scRUBYt! development is in full steam so stay tuned! Just a small example (this is already possible in the present version):

    Task: Turn a HTML table into a comma separated list.

    scRUBYt! in action:

    table_data = P.table do
                    P.row do
                      P.cell 'This is the first <td> in the table!'
                    end
                 end
    
    table_data.to_csv     #we are done!
    table_data.to_xml     #if we want an XML
    table.row[2].cell[3]  #this gives us the 3rd <td> in the 2nd <tr>
    

    This is a very primitive example, I could not come up with an easier one. In practice scRUBYt! will be capable of scraping much more complicated pages (like ebay or amazon), navigate on them, transform the output etc.

  5. Comment to the previous example: The line

    P.cell 'This is the first <td> in the table!'
    

    ‘tells’ scRUBYt! that a table cell looks like this (by copy&pasting its text content from the browser) and the other cells are automatically detected.

Leave a Reply

Your email address will not be published. Required fields are marked *