AJAX Scraping with scRUBYt! – LinkedIn, Google Analytics, Yahoo Suggestions

scrubyt-logo-transparent.png As announced on the scRUBYt! blog, there is a brand new release of scRUBYt!, (among other additions) enabling AJAX scraping. I’d like to present a few examples of kicking the data out of non-trivial-to-scrape pages: LinkedIn, Google Analytics and Yahoo (which in itself is not a big deal – unless you want to scrape the suggestions that pop up after entering a keyword into the search text field).

Without further ado, let’s get down to business!

LinkedIn

Let’s say you’d like to scrape your LinkedIn contact list – first name, last name and e-mail of every contact you have. What makes this task complicated (but not for scRUBYt!) is that the contact list is inserted with AJAX after the page is loaded into the browser, and thus it is ‘invisible’ to a standard HTML parser like Hpricot/Nokogiri, so don’t try with those. Instead, check out how you might do it with scRUBYt!:

property_data = Scrubyt::Extractor.define :agent => :firefox do
  fetch          'https://www.linkedin.com/secure/login'
  fill_textfield 'session_key', '*****'
  fill_textfield 'session_password', '*****'
  submit

  click_link_and_wait 'Connections', 5

  vcard "//li[@class='vcard']" do
    first_name  "//span[@class='given-name']"
    second_name "//span[@class='family-name']"
    email       "//a[@class='email']"
  end
end

puts property_data.to_xml

Result: for the above records:


linkedin.png



is the following:

  
    Alex
    Combas
    *** alex's email ***
  
  
    Peter
    Cooper
    *** peter's e-mail ***
  
  
    Jim
    Cropcho
    *** jim's e-mail***
  

The magick is happening on line 7: you click the ‘Connections’ link and wait 5 seconds, until the list is loaded with AJAX. Then you can scrape the contacts as you would do normally.

Frames won’t stop us – Google Analytics

Besides being AJAXy, google analytics throws some more complexity into the mix: The login fields are in a frame, which is again not trivial to scrape – fortunately scRUBYt! abstracts all that frame handling away and makes this really easy:

data = Scrubyt::Extractor.define :agent => :firefox do
  fetch 'https://www.google.com/analytics/reporting/login'
  frame :name, "login"

  fill_textfield 'Email', '*****'
  fill_textfield 'Passwd', '*****'
  submit_and_wait 5

  pageviews "//div[@id='PageviewsSummary']//li[@class='item_value']", :example_type => :x path
end

puts data.to_xml

All you had to do is to ‘go into’ the frame named login. It looks like any navigation step (and basically we can consider it one) after which the scraping is executed on the document in the frame.
We again used an _and_wait method – it takes some time until everything is loaded after logging in.

Scaping an AJAX pop-up

Technically this is not much different from the first scenario, but it’s interesting nevertheless. The task is to scrape the suggestions that yahoo pops up after you enter something into the search field:


yahoo.png

Here is the scraper:

require 'rubygems'
require 'scrubyt'
require 'cgi'

Scrubyt.logger = Scrubyt::Logger.new


yahoo_data = Scrubyt::Extractor.define :agent => :firefox do
  fetch 'http://www.yahoo.com'
  fill_textfield_and_wait 'p', 'ruby', 5

  suggestion_list "//div[@id='ac_container']//li/a", :example_type => :x path do
    href "href", :type => :attribute do
      escaped_string /&p=(.+?)$/ do
        suggestion lambda {|x| CGI::unescape(x)}, :type => :script
      end
    end
  end
end

p yahoo_data.to_hash

The result:

[{:suggestion=>"ruby tuesday"}, 
 {:suggestion=>"pokemon ruby"},
 {:suggestion=>"ruby bridges"},
 {:suggestion=>"max and ruby"},
 {:suggestion=>"ruby falls"},
 {:suggestion=>"ruby rippey tourk"}, 
 {:suggestion=>"ruby lane"},
 {:suggestion=>"pokemon ruby cheats"},
 {:suggestion=>"ruby skye"},
 {:suggestion=>"ruby lin"}]

You can download the above (and other) examples from the scRUBYt! examples github repository:

git clone git://github.com/scrubber/scrubyt_examples.git

Please check out scRUBYt’s homepage for more info!

30 thoughts on “AJAX Scraping with scRUBYt! – LinkedIn, Google Analytics, Yahoo Suggestions

  1. This is great, but unfortunately, sCRUBYt is too intimidating and has no documentation. For this reason me and many people I know are still kinda stuck on mechanize and Hpricot.

  2. Kokpit,

    A few thoughts:
    - We are working on better documentation all the time – for example the skimr branch (rewriting scRUBYt! from scratch – see http://rubypond.com/articles/2008/12/09/web-spidering-and-data-extraction-with-scrubyt/) is developed with BDD so you have the test descriptions for everything
    - scRUBYt! is not Rails or something in that range – you can find out 90% of the functionality relatively quickly by digging around in the source code
    - it’s an open source software, so we are happy to accept documentation, suggestions or fixes to make it less intimadating – so why not contribute instead of complaining?
    - almost every open source framework could have been ditched with the same reasoning before it got really popular. Fortunately there were enough people to help to reach the tipping point for those projects – this is not true for scRUBYt! yet (though there is a similar amount of bitching, so at least we are even in that area ;-) )

    Thoughts?

  3. By the looks of it you’re writing in markdown, and have forgotten to escape the underscores in We again used an _and_wait method so its rendered it as andwait. Just need to sprinkle a couple of \’s in there :)

  4. This is a great example and will be very helpful on my current scraping project. Does anyone have any experience dealing with pop-up windows. I’m running into error and was hoping someone may have already dealt with a similar situation.

    I’m getting the following error when I try to select an option box on
    a pop-up window.
    Error: Unable to locate element, using :name, “ddlslot1″

    Here is the code:
    Scrubyt.logger = Scrubyt::Logger.new
    data = Scrubyt::Extractor.define :agent => :firefox do
    fetch ‘http://www.tvg.com’
    sleep 1
    filltextfield ‘Login1$txtAccountNumber’, ’142883′
    fill
    textfield ‘Login1$txtPIN’, ’8240′
    selectoption ‘Login1$ddlState’, ‘Massachusetts’
    click
    byxpath(‘/html/body/form/div[2]/div/div/div/div/div[2]/div/
    table/tbody/tr[5]/td[2]/input’)
    sleep 5
    fetch ‘/Authenticated/program/default.aspx’
    # click image to spawn a pop-up window that allows you to change the
    tracks selected on current page.
    click
    byxpath(‘/html/body/form/div[2]/div/div[2]/div[2]/div/div
    [2]/div[7]/img’)
    #Select Dubai as a track # Error occurs
    here where I try to make a selection on the pop-up window
    select
    option ‘ddlSlot1′,’Dubai’

    once the selection(s) are made push this button to confirm the

    selection and return to main page.
    clickbypath(‘/html/body/p/a’)
    results “//body”
    It seems like the focus is not on the pop-up so I am not able to make
    the selection on the selection box. Any help would be
    appreciated.
    Thanks,
    Jim

  5. Jim,

    I am not sure you have this in your actual script too, but here you have

    clickbypath(’/html/body/p/a’)

    (and not click by xpath)

    does that help?

    If not, please post this to the ML, it’s too messy to answer all over the place (PM, mailing list, lightouse, blog comments :-)

  6. Pingback: Will Google stop SERP scrapers by going Ajax? | I Can Has Rankings?

  7. Commercial Publication,near public people interested today let limit hard film tree order alright emerge weight painting slip level mention location again good your express later off end liberal attempt district rich nation post anyway complete conference warn recently increasingly first proportion examination case engine neck develop hit supply student driver it daughter above gain walk fit police opportunity institution work imply extent against today balance through or warm ancient annual ancient payment ancient launch call limited push growing worth of interest over once potential tiny percent before payment

  8. I partake of recently started using the blogengine.net and I having some problems here? in your blog you stated that we miss to enable write permissions on the App_Data folder…unfortunately I don’t agree how to assign it.

  9. Many people dream of blogging for income , and this goalis not far beyond the range of someone with average intelligence, a willingness to work hard, and a basic hold of blogging technology. However, very fewindividuals manage to reap the benefits they need from their blog. Most people who attempt to make money with their blogs do not succeed for two rationalities. Often,bloggers have surrealistic anticipations of fast their readership will uprise and how much money they will reach , and when these expectations are not met theletdown can crush the want to continueblogging. The other trap that some bloggers fall into has to do with deficiency preparation. If you wish to turn a earning as a blogger, the key to success is to get a figurative project and stick with it. To be next part

  10. I partake of recently started using the blogengine.net and I having some problems here? in your blog you stated that we miss to enable write permissions on the App_Data folder…unfortunately I don’t agree how to assign it.

  11. So I was left with the question as to whether true value is derived from
    the fact that in addition to enabling access to your friends (called contacts
    in most conventional tools but connections in LinkedIn terminology) the
    user gains visibility to the friend’s friends (second degree connections),
    and somewhat more limited knowledge of the friends’s friends’s friends
    (third degree connections). This entire network of first, second, and third
    degree connections is much larger than the set of friends (usually by a
    factor of between 1000 to 10,000), and has recently been the subject of
    interest. In order to process information in such networks, various machine
    readable formats for describing a FOAF (Friend Of A Friend) have been
    developed…

  12. Awesome blog! Is your theme custom made or did you download it from somewhere?

    A theme like yours with a few simple adjustements would really make
    my blog stand out. Please let me know where you got your design. Thanks

  13. These attributes must be shown in all places; be it in the workplace or at home.
    The developments and inventions within the Plumbing Daly City systems have
    aided us to get water into our houses and offices, after which drain the waste securely and continuously without doing
    harm to the climate. The advice they offer for
    maintaining the plumbing is an important part of making
    sure the plumbing system operates properly.

  14. A mission statement is an organization’s vision translated into written form making a leader’s view of the direction and purpose
    of an organization concreted for everyone’s view. When the tap springs a leakage or the floor gets flooded from a cracked
    hose upstairs, it gets essential to look for an efficient professional who holds
    adequate experience in this field. These cameras give the plumber
    a clear picture of the break or obstruction in the sewer line.

  15. What i do not understood is in reality how you’re not
    really a lot more well-favored than you may be now.
    You are so intelligent. You recognize therefore considerably on the subject of
    this topic, produced me personally believe it from numerous numerous angles.

    Its like men and women are not interested until it’s one thing to
    do with Lady gaga! Your own stuffs excellent. Always take
    care of it up!

    my web blog: people filling (google.com)

  16. Hi there to every body, it’s my first go to see of this web site; this webpage carries amazing and really fine information designed for visitors.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>