AJAX Scraping with scRUBYt! – LinkedIn, Google Analytics, Yahoo Suggestions

Posted on December 16, 2008 by peter

As announced on the scRUBYt! blog, there is a brand new release of scRUBYt!, (among other additions) enabling AJAX scraping. I’d like to present a few examples of kicking the data out of non-trivial-to-scrape pages: LinkedIn, Google Analytics and Yahoo (which in itself is not a big deal – unless you want to scrape the suggestions that pop up after entering a keyword into the search text field).

Without further ado, let’s get down to business!

Let’s say you’d like to scrape your LinkedIn contact list – first name, last name and e-mail of every contact you have. What makes this task complicated (but not for scRUBYt!) is that the contact list is inserted with AJAX after the page is loaded into the browser, and thus it is ‘invisible’ to a standard HTML parser like Hpricot/Nokogiri, so don’t try with those. Instead, check out how you might do it with scRUBYt!:

property_data = Scrubyt::Extractor.define :agent => :firefox do
  fetch          'https://www.linkedin.com/secure/login'
  fill_textfield 'session_key', '*****'
  fill_textfield 'session_password', '*****'
  submit

  click_link_and_wait 'Connections', 5

  vcard "//li[@class='vcard']" do
    first_name  "//span[@class='given-name']"
    second_name "//span[@class='family-name']"
    email       "//a[@class='email']"
  end
end

puts property_data.to_xml

Result: for the above records:

is the following:

  
    Alex
    Combas
    *** alex's email ***
  
  
    Peter
    Cooper
    *** peter's e-mail ***
  
  
    Jim
    Cropcho
    *** jim's e-mail***

The magick is happening on line 7: you click the ‘Connections’ link and wait 5 seconds, until the list is loaded with AJAX. Then you can scrape the contacts as you would do normally.

Frames won’t stop us – Google Analytics

Besides being AJAXy, google analytics throws some more complexity into the mix: The login fields are in a frame, which is again not trivial to scrape – fortunately scRUBYt! abstracts all that frame handling away and makes this really easy:

data = Scrubyt::Extractor.define :agent => :firefox do
  fetch 'https://www.google.com/analytics/reporting/login'
  frame :name, "login"

  fill_textfield 'Email', '*****'
  fill_textfield 'Passwd', '*****'
  submit_and_wait 5

  pageviews "//div[@id='PageviewsSummary']//li[@class='item_value']", :example_type => :xpath
end

puts data.to_xml

All you had to do is to ‘go into’ the frame named login. It looks like any navigation step (and basically we can consider it one) after which the scraping is executed on the document in the frame.
We again used an _and_wait method – it takes some time until everything is loaded after logging in.

Scaping an AJAX pop-up

Technically this is not much different from the first scenario, but it’s interesting nevertheless. The task is to scrape the suggestions that yahoo pops up after you enter something into the search field:

Here is the scraper:

require 'rubygems'
require 'scrubyt'
require 'cgi'

Scrubyt.logger = Scrubyt::Logger.new


yahoo_data = Scrubyt::Extractor.define :agent => :firefox do
  fetch 'http://www.yahoo.com'
  fill_textfield_and_wait 'p', 'ruby', 5

  suggestion_list "//div[@id='ac_container']//li/a", :example_type => :xpath do
    href "href", :type => :attribute do
      escaped_string /&p=(.+?)$/ do
        suggestion lambda {|x| CGI::unescape(x)}, :type => :script
      end
    end
  end
end

p yahoo_data.to_hash

The result:

[{:suggestion=>"ruby tuesday"}, 
 {:suggestion=>"pokemon ruby"},
 {:suggestion=>"ruby bridges"},
 {:suggestion=>"max and ruby"},
 {:suggestion=>"ruby falls"},
 {:suggestion=>"ruby rippey tourk"}, 
 {:suggestion=>"ruby lane"},
 {:suggestion=>"pokemon ruby cheats"},
 {:suggestion=>"ruby skye"},
 {:suggestion=>"ruby lin"}]

You can download the above (and other) examples from the scRUBYt! examples github repository:

git clone git://github.com/scrubber/scrubyt_examples.git

Please check out scRUBYt’s homepage for more info!

30 thoughts on “AJAX Scraping with scRUBYt! – LinkedIn, Google Analytics, Yahoo Suggestions”

Jeff on December 17, 2008 at 8:04 am said:

very nice! This will definitely come in handy
Koen Van der Auwera on December 17, 2008 at 2:02 pm said:

This is actually very cool. Makes me want to pick up scRUBYt again. I’m definitly going to try this out.
Kokpit on December 18, 2008 at 12:57 am said:

This is great, but unfortunately, sCRUBYt is too intimidating and has no documentation. For this reason me and many people I know are still kinda stuck on mechanize and Hpricot.
Peter on December 18, 2008 at 1:43 am said:

Kokpit,

A few thoughts:
– We are working on better documentation all the time – for example the skimr branch (rewriting scRUBYt! from scratch – see http://rubypond.com/articles/2008/12/09/web-spidering-and-data-extraction-with-scrubyt/) is developed with BDD so you have the test descriptions for everything
– scRUBYt! is not Rails or something in that range – you can find out 90% of the functionality relatively quickly by digging around in the source code
– it’s an open source software, so we are happy to accept documentation, suggestions or fixes to make it less intimadating – so why not contribute instead of complaining?
– almost every open source framework could have been ditched with the same reasoning before it got really popular. Fortunately there were enough people to help to reach the tipping point for those projects – this is not true for scRUBYt! yet (though there is a similar amount of bitching, so at least we are even in that area ;-))

Thoughts?
Caius on December 23, 2008 at 4:18 am said:

By the looks of it you’re writing in markdown, and have forgotten to escape the underscores in We again used an _and_wait method so its rendered it as andwait. Just need to sprinkle a couple of \’s in there 🙂
Jim on February 8, 2009 at 6:19 pm said:

This is a great example and will be very helpful on my current scraping project. Does anyone have any experience dealing with pop-up windows. I’m running into error and was hoping someone may have already dealt with a similar situation.

I’m getting the following error when I try to select an option box on
a pop-up window.
Error: Unable to locate element, using :name, “ddlslot1”

Here is the code:
Scrubyt.logger = Scrubyt::Logger.new
data = Scrubyt::Extractor.define :agent => :firefox do
fetch ‘http://www.tvg.com’
sleep 1
filltextfield ‘Login1$txtAccountNumber’, ‘142883’
filltextfield ‘Login1$txtPIN’, ‘8240’
selectoption ‘Login1$ddlState’, ‘Massachusetts’
clickbyxpath(‘/html/body/form/div[2]/div/div/div/div/div[2]/div/
table/tbody/tr[5]/td[2]/input’)
sleep 5
fetch ‘/Authenticated/program/default.aspx’
# click image to spawn a pop-up window that allows you to change the
tracks selected on current page.
clickbyxpath(‘/html/body/form/div[2]/div/div[2]/div[2]/div/div
[2]/div[7]/img’)
#Select Dubai as a track # Error occurs
here where I try to make a selection on the pop-up window
selectoption ‘ddlSlot1′,’Dubai’

once the selection(s) are made push this button to confirm the

selection and return to main page.
clickbypath(‘/html/body/p/a’)
results “//body”
It seems like the focus is not on the pop-up so I am not able to make
the selection on the selection box. Any help would be
appreciated.
Thanks,
Jim
Peter on February 11, 2009 at 12:47 am said:

Jim,

I am not sure you have this in your actual script too, but here you have

clickbypath(â€™/html/body/p/aâ€™)

(and not click by xpath)

does that help?

If not, please post this to the ML, it’s too messy to answer all over the place (PM, mailing list, lightouse, blog comments 🙂
anya183 likes undies on July 13, 2009 at 10:45 pm said:

When was this exactly?
Martin Chamberlain on September 8, 2009 at 6:35 am said:

Very impressed with ScRUBYt!

It would be great to have an ‘at a glance’ overview of advantages over Hpricot, Mechanize, ScrAPI, Watir etc.
DCKAP on September 11, 2009 at 3:49 am said:

It really great.. hope it will be implemented well.
Pingback: Will Google stop SERP scrapers by going Ajax? | I Can Has Rankings?
Eatcongress on December 4, 2009 at 11:02 am said:

Commercial Publication,near public people interested today let limit hard film tree order alright emerge weight painting slip level mention location again good your express later off end liberal attempt district rich nation post anyway complete conference warn recently increasingly first proportion examination case engine neck develop hit supply student driver it daughter above gain walk fit police opportunity institution work imply extent against today balance through or warm ancient annual ancient payment ancient launch call limited push growing worth of interest over once potential tiny percent before payment
3DTV Informer on April 9, 2010 at 10:03 pm said:

Hey, I see all your blogs, keep them coming.
play at casino online on June 8, 2010 at 5:44 am said:

I partake of recently started using the blogengine.net and I having some problems here? in your blog you stated that we miss to enable write permissions on the App_Data folder…unfortunately I don’t agree how to assign it.
Candis Aschoff on June 16, 2010 at 12:15 am said:

Many people dream of blogging for income , and this goalis not far beyond the range of someone with average intelligence, a willingness to work hard, and a basic hold of blogging technology. However, very fewindividuals manage to reap the benefits they need from their blog. Most people who attempt to make money with their blogs do not succeed for two rationalities. Often,bloggers have surrealistic anticipations of fast their readership will uprise and how much money they will reach , and when these expectations are not met theletdown can crush the want to continueblogging. The other trap that some bloggers fall into has to do with deficiency preparation. If you wish to turn a earning as a blogger, the key to success is to get a figurative project and stick with it. To be next part
BÃ¼yÃ¼ on July 12, 2010 at 2:00 am said:

I partake of recently started using the blogengine.net and I having some problems here? in your blog you stated that we miss to enable write permissions on the App_Data folderâ€¦unfortunately I donâ€™t agree how to assign it.
ÐžÐ¼Ð°Ñ…Ð° Ð¿Ð¾ÐºÐµÑ€ on July 26, 2010 at 7:41 am said:

So I was left with the question as to whether true value is derived from
the fact that in addition to enabling access to your friends (called contacts
in most conventional tools but connections in LinkedIn terminology) the
user gains visibility to the friendâ€™s friends (second degree connections),
and somewhat more limited knowledge of the friendsâ€™s friendsâ€™s friends
(third degree connections). This entire network of ï¬rst, second, and third
degree connections is much larger than the set of friends (usually by a
factor of between 1000 to 10,000), and has recently been the subject of
interest. In order to process information in such networks, various machine
readable formats for describing a FOAF (Friend Of A Friend) have been
developed…
ÃÂ´ÃÂ¾Ã‘ÂÃ‘â€šÃÂ°ÃÂ²ÃÂºÃÂ° ÃÂºÃÂ¸Ã‘â€šÃÂ°ÃÂ¹ on September 22, 2010 at 12:08 pm said:

With this web site article you have help me to uncover the facts which I have to get far more info. Many thanks for that!
thomas sabo jewellery on November 11, 2010 at 11:24 pm said:

Its a shame the comments have been removed from the original article.
Tamra Ohlen on December 1, 2010 at 3:42 pm said:

Good day do you know are profile backlinks worth it? in getting your site ranked higher
nicki minaj naked on January 8, 2011 at 12:58 am said:

god you guys like this hoe cus she looks good.
DiziDukkani on January 25, 2011 at 6:23 am said:

<a href=”http://www.dizidukkani.com” rel=”external” title=”dizi izle,canli dizi izle,full dizi izle
Pokemon on January 25, 2011 at 8:00 pm said:

seems like a pretty usefull scraper
mass money makers upgrade on February 1, 2011 at 3:56 pm said:

very cool web page. Filled me with a healthier knowing of the exact economic crisis. Due mate
Sliding door interior on February 5, 2011 at 10:53 am said:

Great post! Iâ€™ll be sure to add a link to this article… Check out my site if you like…
http://joemonster.org on June 3, 2014 at 2:08 am said:

Awesome blog! Is your theme custom made or did you download it from somewhere?

A theme like yours with a few simple adjustements would really make
my blog stand out. Please let me know where you got your design. Thanks
plumber diamond bar on June 18, 2014 at 6:10 am said:

These attributes must be shown in all places; be it in the workplace or at home.
The developments and inventions within the Plumbing Daly City systems have
aided us to get water into our houses and offices, after which drain the waste securely and continuously without doing
harm to the climate. The advice they offer for
maintaining the plumbing is an important part of making
sure the plumbing system operates properly.
San Dimas Plumbers on June 26, 2014 at 8:59 am said:

A mission statement is an organization’s vision translated into written form making a leader’s view of the direction and purpose
of an organization concreted for everyone’s view. When the tap springs a leakage or the floor gets flooded from a cracked
hose upstairs, it gets essential to look for an efficient professional who holds
adequate experience in this field. These cameras give the plumber
a clear picture of the break or obstruction in the sewer line.
google.com on July 25, 2014 at 7:59 am said:

What i do not understood is in reality how you’re not
really a lot more well-favored than you may be now.
You are so intelligent. You recognize therefore considerably on the subject of
this topic, produced me personally believe it from numerous numerous angles.

Its like men and women are not interested until it’s one thing to
do with Lady gaga! Your own stuffs excellent. Always take
care of it up!

my web blog: people filling (google.com)
Josh on November 17, 2015 at 12:07 am said:

Hi there to every body, it’s my first go to see of this web site; this webpage carries amazing and really fine information designed for visitors.

Ruby, Rails, Web2.0

Experiences with Ruby and Rails, Web2.0 and other development technologies

AJAX Scraping with scRUBYt! – LinkedIn, Google Analytics, Yahoo Suggestions

LinkedIn

Frames won’t stop us – Google Analytics

Scaping an AJAX pop-up

30 thoughts on “AJAX Scraping with scRUBYt! – LinkedIn, Google Analytics, Yahoo Suggestions”

Leave a Reply