1. Sliced bread (1928, Otto Frederick Rohwedder)
2. Ruby (1995, Yukihiro “Matz” Matsumoto)
3. HPricot (2006, Why the lucky stiff)
4. Ruby on Rails (2005, David Heinemeier Hansson)
Needless to say, you need to take this list with a tiny droplet of humor: Of course if we consider development time, amount and scope of offered solutions, innovation, community, book coverage etc. etc. then Rails is a clear winner (and anyway, the two players are not in the same league). However, HPricot is a great example of how a not-new-at-all thing can be made much more usable, fast and “heaps of fun to use” (really) just by clever design and usage of the right tools (and a dash of a cool programmer’s charisma). It is one thing to come up with a purple cow on a non-saturated market with lots of space for innovation, and a different story to do the same when everything has been already said and done. And _why did it. Again.
I am writing a (not so) small web extraction framework in Ruby (planned release XMAS 2006) which heavily relies on HPricot as the HTML query language – so I dare to say I know (at least some parts of) HPricot pretty well, yet it still keeps me totally amazed. What I like the most about it (besides that it is lightning-on-steroids fast compared to anything available for the same task, feature rich, reliable, stable etc. etc.) is that it takes the ‘principle of least surprise’ to the next level: I would call it ‘principle of almost no surprise’. If someone has a bit of knowledge about org.w3c.dom, XML, XPath, XSLT and/or has experience with other HTML/XML parsers/tools will have to refer to the documentation very rarely (of course there is a period of learning the basics and soaking into the HPricot-philosophy, but the learning curve is *really* steep).
Before I get to the proof that HPricot is able to solve the food problems in Africa or something, I need to cool down a bit : HPricot is not for everyone and not for every problem. If you need complex XPath evaluation for instance, you will have to stick with the good old REXML (for now , at least – I read that _why will add more XPath support and other goodies in the future). In the present version, you won’t be able to evaluate things like axes (e.g. ancestor::html) or XPath functions (e.g. normalize-space) and not even XPaths with indices (like html/body/table/tr/td – though I wrote a small script to remedy this problem temporarily.)
There are a lots of HTML-extraction related questions on the Ruby mailing list (like how to extract every table cell from a <tablle:> etc.) My advice is to alwways check out HPricot first: Sometimes it can be an overkill to use it (if you can get what you want with a simple regexp, for example) but usually it is the right tool to parse and query even the ugliest HTML pages out there- unless you need heavy XPath/XQuery machinery which is rarely the case in the real life.
What else do I need to add? Great job, _why. Thanks man.