Evaluating XPaths with indices in HPricot

Currently I am working on a Ruby web-extraction framework (more on this in a future post) and I have chosen Hpricot for the HTML parsing and XPath evaluation. So far, it seems to be the perfect choice – it is lightning fast (nothing I have encountered so far – REXML, Htree, RubyfulSoup – was even near in terms of speed), very intuitive to use, and it simply works for nearly everything you will need for day-to-day tasks.

However, since my Web-extractor is relying on XPath very heavily, sometimes I need things not included in the ‘nearly-everything’ bag of goodies – but so far I have always managed to come up with a solution. Last time it was the evaluation of XPaths with indices. This is what I hacked up:

def to_hpricot(xpath)
  "#{'(' * (xpath.split(/\d+/).size-1)}" +                 
  xpath.gsub(/\d+/) { |num| (num.to_i - 1).to_s }. 
        gsub(/\/(.*?)\[/) { |p| "/'#{$1}')[" }            
end

Whew, that's ugly - but it works quite well. Let's see a simple example:

doc = Hpricot("<p>A<b>very</b>simple<b>
  <i>small</i>test<i>document</i>.</b>Very cool.</p>")

eval(to_hpricot('doc/p[1]/b[2]/i[2]')) 
=> {elem <i> {text "document"} </i>}

OK, it is really just a hack – it does not consider special cases, does not handle anything besides indices (i.e. no axes, @’s or anything) etc. – but anyway this is hopefully just a temporary solution and _why will add evaluation of indexed XPaths to Hpricot soon. There would be at least one HPricot user who would be really delighted about it :-)

6 thoughts on “Evaluating XPaths with indices in HPricot

  1. Yeah, I know about scrAPI, but I am going to take a different (X)Path, both in the wrapper generation phase – instead of providing concrete XPaths (= CSS selectors in scrAPI), there will be a possibility to define the data you are looking for with examples, so in an ideal case you won’t need to use XPath at all – and in the evaluation phase – lots of heuristics, XPath instead of CSS. The primary aim will be to create a wrapper generator which needs minimum input and technical knowledge (of course it will be possible to use XPaths or even Ruby if you wish) yet performs robust extractions and will be usable to quickly scrap sites like amazon, ebay etc. and further integrate the data into something usable in real-life.

    The goal is to create an easy-to-use wraper generator which works in practice and will make possible to create mashups or use the extracted data further in any RoR or Ruby (or any other) app…

  2. Funny I was looking at Hpricot, and then looking at alternatives (I want to scrape my online bank statements), when I found your original – nicely detailed – page on scraping written in June. I was researching all of these suggestions you made until I read to the last comment and saw that you’d settled on Hpricot anyway. ;)

    What I’d like to know is do you still maintain that Mechanize is the best way to do the navigation, and then maybe Hpricot for the scraping once you have the page?

    Or does Hpricot give you some way to navigate?

    If the former can you show us some Hpricot, Mechanize combined code?

  3. Pingback: Levaquin.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>