December 16th, 2008
As announced on the scRUBYt! blog, there is a brand new release of scRUBYt!, (among other additions) enabling AJAX scraping. I’d like to present a few examples of kicking the data out of non-trivial-to-scrape pages: LinkedIn, Google Analytics and Yahoo (which in itself is not a big deal - unless you want to scrape the suggestions that pop up after entering a keyword into the search text field).
Without further ado, let’s get down to business!
Let’s say you’d like to scrape your LinkedIn contact list - first name, last name and e-mail of every contact you have. What makes this task complicated (but not for scRUBYt!) is that the contact list is inserted with AJAX after the page is loaded into the browser, and thus it is ‘invisible’ to a standard HTML parser like Hpricot/Nokogiri, so don’t try with those. Instead, check out how you might do it with scRUBYt!:
- property_data = Scrubyt::Extractor.define :agent => :firefox do
- fetch ‘https://www.linkedin.com/secure/login’
- fill_textfield ’session_key’, ‘*****’
- fill_textfield ’session_password’, ‘*****’
- submit
- click_link_and_wait ‘Connections’, 5
- vcard "//li[@class='vcard']" do
- first_name "//span[@class='given-name']"
- second_name "//span[@class='family-name']"
- email "//a[@class='email']"
- end
- end
- puts property_data.to_xml
Result: for the above records:
is the following:
- <vcard>
- <first_name>Alex</first_name>
- <second_name>Combas</second_name>
- <email>*** alex’s email ***</email>
- </vcard>
- <vcard>
- <first_name>Peter</first_name>
- <second_name>Cooper</second_name>
- <email>*** peter’s e-mail ***</email>
- </vcard>
- <vcard>
- <first_name>Jim</first_name>
- <second_name>Cropcho</second_name>
- <email>*** jim’s e-mail***</email>
- </vcard>
The magick is happening on line 7: you click the ‘Connections’ link and wait 5 seconds, until the list is loaded with AJAX. Then you can scrape the contacts as you would do normally.
Frames won’t stop us - Google Analytics
Besides being AJAXy, google analytics throws some more complexity into the mix: The login fields are in a frame, which is again not trivial to scrape - fortunately scRUBYt! abstracts all that frame handling away and makes this really easy:
- data = Scrubyt::Extractor.define :agent => :firefox do
- fetch ‘https://www.google.com/analytics/reporting/login’
- frame :name, "login"
- fill_textfield ‘Email’, ‘*****’
- fill_textfield ‘Passwd’, ‘*****’
- submit_and_wait 5
- pageviews "//div[@id='PageviewsSummary']//li[@class='item_value']", :example_type => :xpath
- end
- puts data.to_xml
All you had to do is to ‘go into’ the frame named login. It looks like any navigation step (and basically we can consider it one) after which the scraping is executed on the document in the frame. We again used an andwait method - it takes some time until everything is loaded after logging in.
Scaping an AJAX pop-up
Technically this is not much different from the first scenario, but it’s interesting nevertheless. The task is to scrape the suggestions that yahoo pops up after you enter something into the search field:
Here is the scraper:
- require ‘rubygems’
- require ’scrubyt’
- require ‘cgi’
- Scrubyt.logger = Scrubyt::Logger.new
- yahoo_data = Scrubyt::Extractor.define :agent => :firefox do
- fetch ‘http://www.yahoo.com‘
- fill_textfield_and_wait ‘p‘, ‘ruby’, 5
- suggestion_list "//div[@id='ac_container']//li/a", :example_type => :xpath do
- href "href", :type => :attribute do
- escaped_string /&p=(.+?)$/ do
- suggestion lambda {|x| CGI::unescape(x)}, :type => :script
- end
- end
- end
- end
- p yahoo_data.to_hash
The result:
- [{:suggestion=>"ruby tuesday"},
- {:suggestion=>"pokemon ruby"},
- {:suggestion=>"ruby bridges"},
- {:suggestion=>"max and ruby"},
- {:suggestion=>"ruby falls"},
- {:suggestion=>"ruby rippey tourk"},
- {:suggestion=>"ruby lane"},
- {:suggestion=>"pokemon ruby cheats"},
- {:suggestion=>"ruby skye"},
- {:suggestion=>"ruby lin"}]
You can download the above (and other) examples from the scRUBYt! examples github repository:
git clone git://github.com/scrubber/scrubyt_examples.git
Please check out scRUBYt’s homepage for more info!

December 17th, 2008 at 8:04 am
very nice! This will definitely come in handy
December 17th, 2008 at 2:02 pm
This is actually very cool. Makes me want to pick up scRUBYt again. I’m definitly going to try this out.
December 18th, 2008 at 12:57 am
This is great, but unfortunately, sCRUBYt is too intimidating and has no documentation. For this reason me and many people I know are still kinda stuck on mechanize and Hpricot.
December 18th, 2008 at 1:43 am
Kokpit,
A few thoughts:
- We are working on better documentation all the time - for example the skimr branch (rewriting scRUBYt! from scratch - see http://rubypond.com/articles/2008/12/09/web-spidering-and-data-extraction-with-scrubyt/) is developed with BDD so you have the test descriptions for everything
- scRUBYt! is not Rails or something in that range - you can find out 90% of the functionality relatively quickly by digging around in the source code
- it’s an open source software, so we are happy to accept documentation, suggestions or fixes to make it less intimadating - so why not contribute instead of complaining?
- almost every open source framework could have been ditched with the same reasoning before it got really popular. Fortunately there were enough people to help to reach the tipping point for those projects - this is not true for scRUBYt! yet (though there is a similar amount of bitching, so at least we are even in that area ;-))
Thoughts?
December 23rd, 2008 at 4:18 am
By the looks of it you’re writing in markdown, and have forgotten to escape the underscores in
We again used an _and_wait methodso its rendered it as andwait. Just need to sprinkle a couple of \’s in thereFebruary 8th, 2009 at 6:19 pm
This is a great example and will be very helpful on my current scraping project. Does anyone have any experience dealing with pop-up windows. I’m running into error and was hoping someone may have already dealt with a similar situation.
I’m getting the following error when I try to select an option box on
a pop-up window.
Error: Unable to locate element, using :name, “ddlslot1″
Here is the code:
Scrubyt.logger = Scrubyt::Logger.new
data = Scrubyt::Extractor.define :agent => :firefox do
fetch ‘http://www.tvg.com’
sleep 1
filltextfield ‘Login1$txtAccountNumber’, ‘142883′
filltextfield ‘Login1$txtPIN’, ‘8240′
selectoption ‘Login1$ddlState’, ‘Massachusetts’
clickbyxpath(’/html/body/form/div[2]/div/div/div/div/div[2]/div/
table/tbody/tr[5]/td[2]/input’)
sleep 5
fetch ‘/Authenticated/program/default.aspx’
# click image to spawn a pop-up window that allows you to change the
tracks selected on current page.
clickbyxpath(’/html/body/form/div[2]/div/div[2]/div[2]/div/div
[2]/div[7]/img’)
#Select Dubai as a track # Error occurs
here where I try to make a selection on the pop-up window
selectoption ‘ddlSlot1′,’Dubai’
once the selection(s) are made push this button to confirm the
selection and return to main page.
clickbypath(’/html/body/p/a’)
results “//body”
It seems like the focus is not on the pop-up so I am not able to make
the selection on the selection box. Any help would be
appreciated.
Thanks,
Jim
February 11th, 2009 at 12:47 am
Jim,
I am not sure you have this in your actual script too, but here you have
clickbypath(’/html/body/p/a’)
(and not click by xpath)
does that help?
If not, please post this to the ML, it’s too messy to answer all over the place (PM, mailing list, lightouse, blog comments
July 13th, 2009 at 10:45 pm
When was this exactly?
September 8th, 2009 at 6:35 am
Very impressed with ScRUBYt!
It would be great to have an ‘at a glance’ overview of advantages over Hpricot, Mechanize, ScrAPI, Watir etc.
September 11th, 2009 at 3:49 am
It really great.. hope it will be implemented well.
October 13th, 2009 at 3:38 am
[...] a slew of different libraries which mimic browsing behavior and also handle ajax. Off the cuff, ScrubyT handles ajax quite well. Watir can actually open an instance of IE, FF, Safari, or even Chrome, and [...]
December 4th, 2009 at 11:02 am
Commercial Publication,near public people interested today let limit hard film tree order alright emerge weight painting slip level mention location again good your express later off end liberal attempt district rich nation post anyway complete conference warn recently increasingly first proportion examination case engine neck develop hit supply student driver it daughter above gain walk fit police opportunity institution work imply extent against today balance through or warm ancient annual ancient payment ancient launch call limited push growing worth of interest over once potential tiny percent before payment
April 9th, 2010 at 10:03 pm
Hey, I see all your blogs, keep them coming.
June 8th, 2010 at 5:44 am
I partake of recently started using the blogengine.net and I having some problems here? in your blog you stated that we miss to enable write permissions on the App_Data folder…unfortunately I don’t agree how to assign it.
June 16th, 2010 at 12:15 am
Many people dream of blogging for income , and this goalis not far beyond the range of someone with average intelligence, a willingness to work hard, and a basic hold of blogging technology. However, very fewindividuals manage to reap the benefits they need from their blog. Most people who attempt to make money with their blogs do not succeed for two rationalities. Often,bloggers have surrealistic anticipations of fast their readership will uprise and how much money they will reach , and when these expectations are not met theletdown can crush the want to continueblogging. The other trap that some bloggers fall into has to do with deficiency preparation. If you wish to turn a earning as a blogger, the key to success is to get a figurative project and stick with it. To be next part
July 12th, 2010 at 2:00 am
I partake of recently started using the blogengine.net and I having some problems here? in your blog you stated that we miss to enable write permissions on the App_Data folder…unfortunately I don’t agree how to assign it.
July 26th, 2010 at 7:41 am
So I was left with the question as to whether true value is derived from
the fact that in addition to enabling access to your friends (called contacts
in most conventional tools but connections in LinkedIn terminology) the
user gains visibility to the friend’s friends (second degree connections),
and somewhat more limited knowledge of the friends’s friends’s friends
(third degree connections). This entire network of first, second, and third
degree connections is much larger than the set of friends (usually by a
factor of between 1000 to 10,000), and has recently been the subject of
interest. In order to process information in such networks, various machine
readable formats for describing a FOAF (Friend Of A Friend) have been
developed…