header image

scrubyt-logo-transparent.png As announced on the scRUBYt! blog, there is a brand new release of scRUBYt!, (among other additions) enabling AJAX scraping. I’d like to present a few examples of kicking the data out of non-trivial-to-scrape pages: LinkedIn, Google Analytics and Yahoo (which in itself is not a big deal - unless you want to scrape the suggestions that pop up after entering a keyword into the search text field).

Without further ado, let’s get down to business!

LinkedIn

Let’s say you’d like to scrape your LinkedIn contact list - first name, last name and e-mail of every contact you have. What makes this task complicated (but not for scRUBYt!) is that the contact list is inserted with AJAX after the page is loaded into the browser, and thus it is ‘invisible’ to a standard HTML parser like Hpricot/Nokogiri, so don’t try with those. Instead, check out how you might do it with scRUBYt!:

  1. property_data = Scrubyt::Extractor.define :agent => :firefox do
  2.   fetch          ‘https://www.linkedin.com/secure/login’
  3.   fill_textfield ’session_key’, ‘*****’
  4.   fill_textfield ’session_password’, ‘*****’
  5.   submit
  6.  
  7.   click_link_and_wait ‘Connections’, 5
  8.  
  9.   vcard "//li[@class='vcard']" do
  10.     first_name  "//span[@class='given-name']"
  11.     second_name "//span[@class='family-name']"
  12.     email       "//a[@class='email']"
  13.   end
  14. end
  15.  
  16. puts property_data.to_xml

Result: for the above records:

linkedin.png

is the following:

  1. <vcard>
  2.     <first_name>Alex</first_name>
  3.     <second_name>Combas</second_name>
  4.     <email>*** alex’s email ***</email>
  5.   </vcard>
  6.   <vcard>
  7.     <first_name>Peter</first_name>
  8.     <second_name>Cooper</second_name>
  9.     <email>*** peter’s e-mail ***</email>
  10.   </vcard>
  11.   <vcard>
  12.     <first_name>Jim</first_name>
  13.     <second_name>Cropcho</second_name>
  14.     <email>*** jim’s e-mail***</email>
  15.   </vcard>

The magick is happening on line 7: you click the ‘Connections’ link and wait 5 seconds, until the list is loaded with AJAX. Then you can scrape the contacts as you would do normally.

Frames won’t stop us - Google Analytics

Besides being AJAXy, google analytics throws some more complexity into the mix: The login fields are in a frame, which is again not trivial to scrape - fortunately scRUBYt! abstracts all that frame handling away and makes this really easy:

  1. data = Scrubyt::Extractor.define :agent => :firefox do
  2.   fetch ‘https://www.google.com/analytics/reporting/login’
  3.   frame :name, "login"
  4.  
  5.   fill_textfield ‘Email’, ‘*****’
  6.   fill_textfield ‘Passwd’, ‘*****’
  7.   submit_and_wait 5
  8.  
  9.   pageviews "//div[@id='PageviewsSummary']//li[@class='item_value']", :example_type => :xpath
  10. end
  11.  
  12. puts data.to_xml

All you had to do is to ‘go into’ the frame named login. It looks like any navigation step (and basically we can consider it one) after which the scraping is executed on the document in the frame. We again used an andwait method - it takes some time until everything is loaded after logging in.

Scaping an AJAX pop-up

Technically this is not much different from the first scenario, but it’s interesting nevertheless. The task is to scrape the suggestions that yahoo pops up after you enter something into the search field:

yahoo.png

Here is the scraper:

  1. require ‘rubygems’
  2. require ’scrubyt’
  3. require ‘cgi’
  4.  
  5. Scrubyt.logger = Scrubyt::Logger.new
  6.  
  7.  
  8. yahoo_data = Scrubyt::Extractor.define :agent => :firefox do
  9.   fetch ‘http://www.yahoo.com
  10.   fill_textfield_and_wait ‘p‘, ‘ruby’, 5
  11.  
  12.   suggestion_list "//div[@id='ac_container']//li/a", :example_type => :xpath do
  13.     href "href", :type => :attribute do
  14.       escaped_string /&p=(.+?)$/ do
  15.         suggestion lambda {|x| CGI::unescape(x)}, :type => :script
  16.       end
  17.     end
  18.   end
  19. end
  20.  
  21. p yahoo_data.to_hash

The result:

  1. [{:suggestion=>"ruby tuesday"},
  2.  {:suggestion=>"pokemon ruby"},
  3.  {:suggestion=>"ruby bridges"},
  4.  {:suggestion=>"max and ruby"},
  5.  {:suggestion=>"ruby falls"},
  6.  {:suggestion=>"ruby rippey tourk"},
  7.  {:suggestion=>"ruby lane"},
  8.  {:suggestion=>"pokemon ruby cheats"},
  9.  {:suggestion=>"ruby skye"},
  10.  {:suggestion=>"ruby lin"}]

You can download the above (and other) examples from the scRUBYt! examples github repository:

git clone git://github.com/scrubber/scrubyt_examples.git

Please check out scRUBYt’s homepage for more info!



If you liked the article, subscribe to the feed   and follow me on twitter!.


      

17 Responses to “AJAX Scraping with scRUBYt! - LinkedIn, Google Analytics, Yahoo Suggestions”

  1. Jeff Says:

    very nice! This will definitely come in handy

  2. Koen Van der Auwera Says:

    This is actually very cool. Makes me want to pick up scRUBYt again. I’m definitly going to try this out.

  3. Kokpit Says:

    This is great, but unfortunately, sCRUBYt is too intimidating and has no documentation. For this reason me and many people I know are still kinda stuck on mechanize and Hpricot.

  4. Peter Says:

    Kokpit,

    A few thoughts:
    - We are working on better documentation all the time - for example the skimr branch (rewriting scRUBYt! from scratch - see http://rubypond.com/articles/2008/12/09/web-spidering-and-data-extraction-with-scrubyt/) is developed with BDD so you have the test descriptions for everything
    - scRUBYt! is not Rails or something in that range - you can find out 90% of the functionality relatively quickly by digging around in the source code
    - it’s an open source software, so we are happy to accept documentation, suggestions or fixes to make it less intimadating - so why not contribute instead of complaining?
    - almost every open source framework could have been ditched with the same reasoning before it got really popular. Fortunately there were enough people to help to reach the tipping point for those projects - this is not true for scRUBYt! yet (though there is a similar amount of bitching, so at least we are even in that area ;-))

    Thoughts?

  5. Caius Says:

    By the looks of it you’re writing in markdown, and have forgotten to escape the underscores in We again used an _and_wait method so its rendered it as andwait. Just need to sprinkle a couple of \’s in there :)

  6. Jim Says:

    This is a great example and will be very helpful on my current scraping project. Does anyone have any experience dealing with pop-up windows. I’m running into error and was hoping someone may have already dealt with a similar situation.

    I’m getting the following error when I try to select an option box on
    a pop-up window.
    Error: Unable to locate element, using :name, “ddlslot1″

    Here is the code:
    Scrubyt.logger = Scrubyt::Logger.new
    data = Scrubyt::Extractor.define :agent => :firefox do
    fetch ‘http://www.tvg.com’
    sleep 1
    filltextfield ‘Login1$txtAccountNumber’, ‘142883′
    fill
    textfield ‘Login1$txtPIN’, ‘8240′
    selectoption ‘Login1$ddlState’, ‘Massachusetts’
    click
    byxpath(’/html/body/form/div[2]/div/div/div/div/div[2]/div/
    table/tbody/tr[5]/td[2]/input’)
    sleep 5
    fetch ‘/Authenticated/program/default.aspx’
    # click image to spawn a pop-up window that allows you to change the
    tracks selected on current page.
    click
    byxpath(’/html/body/form/div[2]/div/div[2]/div[2]/div/div
    [2]/div[7]/img’)
    #Select Dubai as a track # Error occurs
    here where I try to make a selection on the pop-up window
    select
    option ‘ddlSlot1′,’Dubai’

    once the selection(s) are made push this button to confirm the

    selection and return to main page.
    clickbypath(’/html/body/p/a’)
    results “//body”
    It seems like the focus is not on the pop-up so I am not able to make
    the selection on the selection box. Any help would be
    appreciated.
    Thanks,
    Jim

  7. Peter Says:

    Jim,

    I am not sure you have this in your actual script too, but here you have

    clickbypath(’/html/body/p/a’)

    (and not click by xpath)

    does that help?

    If not, please post this to the ML, it’s too messy to answer all over the place (PM, mailing list, lightouse, blog comments :-)

  8. anya183 likes undies Says:

    When was this exactly?

  9. Martin Chamberlain Says:

    Very impressed with ScRUBYt!

    It would be great to have an ‘at a glance’ overview of advantages over Hpricot, Mechanize, ScrAPI, Watir etc.

  10. DCKAP Says:

    It really great.. hope it will be implemented well.

  11. Will Google stop SERP scrapers by going Ajax? | I Can Has Rankings? Says:

    [...] a slew of different libraries which mimic browsing behavior and also handle ajax. Off the cuff, ScrubyT handles ajax quite well. Watir can actually open an instance of IE, FF, Safari, or even Chrome, and [...]

  12. Eatcongress Says:

    Commercial Publication,near public people interested today let limit hard film tree order alright emerge weight painting slip level mention location again good your express later off end liberal attempt district rich nation post anyway complete conference warn recently increasingly first proportion examination case engine neck develop hit supply student driver it daughter above gain walk fit police opportunity institution work imply extent against today balance through or warm ancient annual ancient payment ancient launch call limited push growing worth of interest over once potential tiny percent before payment

  13. 3DTV Informer Says:

    Hey, I see all your blogs, keep them coming.

  14. play at casino online Says:

    I partake of recently started using the blogengine.net and I having some problems here? in your blog you stated that we miss to enable write permissions on the App_Data folder…unfortunately I don’t agree how to assign it.

  15. Candis Aschoff Says:

    Many people dream of blogging for income , and this goalis not far beyond the range of someone with average intelligence, a willingness to work hard, and a basic hold of blogging technology. However, very fewindividuals manage to reap the benefits they need from their blog. Most people who attempt to make money with their blogs do not succeed for two rationalities. Often,bloggers have surrealistic anticipations of fast their readership will uprise and how much money they will reach , and when these expectations are not met theletdown can crush the want to continueblogging. The other trap that some bloggers fall into has to do with deficiency preparation. If you wish to turn a earning as a blogger, the key to success is to get a figurative project and stick with it. To be next part

  16. Büyü Says:

    I partake of recently started using the blogengine.net and I having some problems here? in your blog you stated that we miss to enable write permissions on the App_Data folder…unfortunately I don’t agree how to assign it.

  17. Омаха покер Says:

    So I was left with the question as to whether true value is derived from
    the fact that in addition to enabling access to your friends (called contacts
    in most conventional tools but connections in LinkedIn terminology) the
    user gains visibility to the friend’s friends (second degree connections),
    and somewhat more limited knowledge of the friends’s friends’s friends
    (third degree connections). This entire network of first, second, and third
    degree connections is much larger than the set of friends (usually by a
    factor of between 1000 to 10,000), and has recently been the subject of
    interest. In order to process information in such networks, various machine
    readable formats for describing a FOAF (Friend Of A Friend) have been
    developed…

Leave a Reply




Bad Behavior has blocked 1031 access attempts in the last 7 days.