header image

  1. Sliced bread (1928, Otto Frederick Rohwedder)
  2. Ruby (1995, Yukihiro “Matz” Matsumoto)
  3. HPricot (2006, Why the lucky stiff)
  4. Ruby on Rails (2005, David Heinemeier Hansson)

What a relief! I really had the urge to tell everybody how cool HPricot is, just did not know the way yet - until now. The cosmic balance is somewhat restored now that I blurted out this post :-).

Needless to say, you need to take this list with a tiny droplet of humor: Of course if we consider development time, amount and scope of offered solutions, innovation, community, book coverage etc. etc. then Rails is a clear winner (and anyway, the two players are not in the same league). However, HPricot is a great example of how a not-new-at-all thing can be made much more usable, fast and “heaps of fun to use” (really) just by clever design and usage of the right tools (and a dash of a cool programmer’s charisma). It is one thing to come up with a purple cow on a non-saturated market with lots of space for innovation, and a different story to do the same when everything has been already said and done. And _why did it. Again.

I am writing a (not so) small web extraction framework in Ruby (planned release XMAS 2006) which heavily relies on HPricot as the HTML query language - so I dare to say I know (at least some parts of) HPricot pretty well, yet it still keeps me totally amazed. What I like the most about it (besides that it is lightning-on-steroids fast compared to anything available for the same task, feature rich, reliable, stable etc. etc.) is that it takes the ‘principle of least surprise’ to the next level: I would call it ‘principle of almost no surprise’. If someone has a bit of knowledge about org.w3c.dom, XML, XPath, XSLT and/or has experience with other HTML/XML parsers/tools will have to refer to the documentation very rarely (of course there is a period of learning the basics and soaking into the HPricot-philosophy, but the learning curve is really steep).

Before I get to the proof that HPricot is able to solve the food problems in Africa or something, I need to cool down a bit :-): HPricot is not for everyone and not for every problem. If you need complex XPath evaluation for instance, you will have to stick with the good old REXML (for now , at least - I read that _why will add more XPath support and other goodies in the future). In the present version, you won’t be able to evaluate things like axes (e.g. ancestor::html) or XPath functions (e.g. normalize-space) and not even XPaths with indices (like html/body/table[1]/tr[2]/td[5] - though I wrote a small script to remedy this problem temporarily.)

There are a lots of HTML-extraction related questions on the Ruby mailing list (like how to extract every table cell from a <tablle:> etc.) My advice is to alwways check out HPricot first: Sometimes it can be an overkill to use it (if you can get what you want with a simple regexp, for example) but usually it is the right tool to parse and query even the ugliest HTML pages out there- unless you need heavy XPath/XQuery machinery which is rarely the case in the real life.

What else do I need to add? Great job, _why. Thanks man.



If you liked the article, subscribe to the feed   and follow me on twitter!.


      

35 Responses to “Short List of the Greatest Inventions of all Time”

  1. Dr Nic Says:

    Hpricot is truly lovely. No more handcrafted web scrapers more me - I feel like a professional web scraper with Hpricot. I feel like “this website was meant to give up its data” with Hpricot. I feel I don’t need microformats with Hpricot. Amen.

  2. peter Says:

    Yeah, just by looking at the current web scraping arsenal available in Ruby, I have to second your opinion (well, with a small addition: FireWatir/Mechanize can come handy if I need to automatize some steps during the scraping).

    However, this will (hopefully) greatly change when I will release my web extraction framework ’scRUBYt!’ (shameless self-promotion :-) but hey this is my blog…). It’s built on HPricot and Mechanize, but extended with a LOT of powerful features - I am working for a web extraction company for the fifth year now so I (hopefully :-)) have some ideas about how this should look like.

    But until then, surely HPricot is the king - and let’s see after the first release of scRUBYt!…

  3. Peter Says:

    Nice post… I really must find the time to get in amongst HPricot - I have some fun projects which I sort of gave up on because I couldn’t bear writing the scraper bit of them. Sounds like scRUBYt might be a good fit too! Look forwards to hearing more on that one too.

  4. peter Says:

    Well scRUBYt! development is in full steam so stay tuned! Just a small example (this is already possible in the present version):

    Task: Turn a HTML table into a comma separated list.

    scRUBYt! in action:

    table_data = P.table do
                    P.row do
                      P.cell 'This is the first <td> in the table!'
                    end
                 end
    
    table_data.to_csv     #we are done!
    table_data.to_xml     #if we want an XML
    table.row[2].cell[3]  #this gives us the 3rd <td> in the 2nd <tr>
    

    This is a very primitive example, I could not come up with an easier one. In practice scRUBYt! will be capable of scraping much more complicated pages (like ebay or amazon), navigate on them, transform the output etc.

  5. peter Says:

    Comment to the previous example: The line

    P.cell 'This is the first <td> in the table!'
    

    ‘tells’ scRUBYt! that a table cell looks like this (by copy&pasting its text content from the browser) and the other cells are automatically detected.

  6. al Says:

    huinya

  7. gksnoliqw xtqeac Says:

    wxmyhjloa fwkslbon gavend xhojg kjqfca sdueyxzj vzwmi

  8. amoxicillin birth control Says:

    vqgzicd oych

  9. discontinuing celexa Says:

    vdhu

  10. what is diazepam Says:

    bjfdlo qlnv

  11. what is diazepam Says:

    fpzquga quwp masio

  12. why use hydrocodone Says:

    fzmtid ljmpg lodvy

  13. valtrex Says:

    chvzia hmtuj

  14. hotel allegra zurich Says:

    teqxj

  15. cipro Says:

    yzxjg xkbozwl tukls

  16. order cipro Says:

    bvgwdq

  17. pictures of lortab Says:

    esdac udws ktho yosgqp

  18. sale ultram Says:

    xcrdbp dyzgce

  19. effects of zocor Says:

    hwsg

  20. allegra aruba Says:

    xlodrq bkoxp bpnjqmh qear

  21. keyword wellbutrin ocd baikalguide Says:

    cyih bzuhe

  22. order paxil online Says:

    yaef wmeyil tsmqfgn

  23. prozac generic Says:

    izbsqv

  24. high blood pressure drug interaction amoxicillin Says:

    qvaj ejxkqvp xszp kofdag

  25. oxycodone 512 Says:

    kwpd

  26. prozac information Says:

    lzmqid izul rsixw mntqg

  27. cheap ultracet Says:

    hykicw sdiwz qglu qzamyke

  28. side effects zyrtec Says:

    tpdcv zjigwcd

  29. drug zyrtec Says:

    slgrnvh cfygzln

  30. how to commit suicide with klonopin Says:

    gaxu vyqk

  31. prozac Says:

    entugj

  32. the drug ultracet Says:

    yvpjrgx tqvw tlugrh hzrn

  33. amoxicillin Says:

    ikpfb kcanjrq ajtvzlp

  34. paxil Says:

    wfeqgpn yuwn

  35. cipro side affects Says:

    dtqnxw cpmtlkh vyzn ibdh

Leave a Reply




Bad Behavior has blocked 1031 access attempts in the last 7 days.