HTML Scraping with scRUBYt! for Fun and Profit

scrubyt-logo-transparent.pngThe Ruby Advent season is almost over, but there is still something cool to read for the remaining 3 days – so be sure to check out my second contribution to Ruby Advent, HTML Scraping with scRUBYt! for Fun and Profit as well – it’s basically a quick, from the scratch introduction to scRUBYt! which does not go very deep, but gives you an idea of web scraping if you’d like to get your feet wet!

Lite Fixtures – The Last Nail in the standard Fixtures’ Coffin


We all know that standard Rails fixtures suck – they were good at the time of their invention, mainly because nothing better existed – but numerous weaknesses and problems have been mentioned since then.

After some time people addressed the pain points with various approaches – factory\_girl, object\_daddy, machinist, you name it.

While the above solutions are really great, they are quite different from standard fixtures in their philosophy, and I believe because of that some people just rolled on with the standard solution. Well, there is a new player in the town, so no more excuses.

Lite Fixtures are backward compatible with normal fixtures, while drastically reducing their complexity. Check out these examples from the GitHub README:

Patterns in fixture names – Using the pattern “(owners)s\_(color)\_(make)” in conjunction with the fixture
name “Freds\_red\_Ford” unpacks owner to Fred, color to red and make to Ford.

 
      (owners)s_(color)_(make)
 
        Freds_red_Ford:
          year: 1977
          
        Eds_blue_Chevy:
          year: 1987
      
        
    Becomes
    
        Freds_red_Ford:
          owner: Fred
          color: red
          make:  Ford
          year:  1977
 
        Eds_blue_Chevy:
          owner: Ed
          color: blue
          make:  Chevy
          year: 1987

Grouping of Data – Often fixtures group cleanly- family of users, manufacturers of cars, etc. Lite fixtures lets you nest data, so scoped values are propigated inward.

    red_fords:
      make: ford
      color: red
 
      mustang:
        owner: freddy
 
      taurus:
        owner: freddy
          
    Becomes:
    
      mustang:
        make: ford
        color: red
        owner: freddy
 
      taurus:
        make: ford
        color: red
        owner: freddy

I really think this improvement should make it to the Rails core – unless they are replacing the fixtures with something radically different.

Dreaming of a Ruby Christmas

This Week’s Ruby Quiz is to create something creative XMas related in Ruby. Glen F. Pankow came with the following solution:

You can check out / download the code from here – which btw prints the following poem:

On the first day of Matzmas my true love gave to me:
   A new version of Ruby!

On the second day of Matzmas my true love gave to me:
   Two string gsubs
   And a new version of Ruby!

On the third day of Matzmas my true love gave to me:
   Three forked threads
   Two string gsubs
   And a new version of Ruby!

On the fourth day of Matzmas my true love gave to me:
   Four calling procs
   Three forked threads
   Two string gsubs
   And a new version of Ruby!

On the fifth day of Matzmas my true love gave to me:
   Five Ruby gems!
   Four calling procs
   Three forked threads
   Two string gsubs
   And a new version of Ruby!

On the sixth day of Matzmas my true love gave to me:
   Six marshals dumping
   Five Ruby gems!
   Four calling procs
   Three forked threads
   Two string gsubs
   And a new version of Ruby!

On the seventh day of Matzmas my true love gave to me:
   Seven ducks a-typing
   Six marshals dumping
   Five Ruby gems!
   Four calling procs
   Three forked threads
   Two string gsubs
   And a new version of Ruby!

On the eighth day of Matzmas my true love gave to me:
   Eight dirs a-globbing
   Seven ducks a-typing
   Six marshals dumping
   Five Ruby gems!
   Four calling procs
   Three forked threads
   Two string gsubs
   And a new version of Ruby!

On the ninth day of Matzmas my true love gave to me:
   Nine ranges stepping
   Eight dirs a-globbing
   Seven ducks a-typing
   Six marshals dumping
   Five Ruby gems!
   Four calling procs
   Three forked threads
   Two string gsubs
   And a new version of Ruby!

On the tenth day of Matzmas my true love gave to me:
   Ten trys a-catching
   Nine ranges stepping
   Eight dirs a-globbing
   Seven ducks a-typing
   Six marshals dumping
   Five Ruby gems!
   Four calling procs
   Three forked threads
   Two string gsubs
   And a new version of Ruby!

On the eleventh day of Matzmas my true love gave to me:
   Eleven ios piping
   Ten trys a-catching
   Nine ranges stepping
   Eight dirs a-globbing
   Seven ducks a-typing
   Six marshals dumping
   Five Ruby gems!
   Four calling procs
   Three forked threads
   Two string gsubs
   And a new version of Ruby!

On the twelveth day of Matzmas my true love gave to me:
   Twelve monkeys patching
   Eleven ios piping
   Ten trys a-catching
   Nine ranges stepping
   Eight dirs a-globbing
   Seven ducks a-typing
   Six marshals dumping
   Five Ruby gems!
   Four calling procs
   Three forked threads
   Two string gsubs
   And a new version of Ruby!

Hats down before this solution, I really bow down to Glen’s creativity and technique – the only problem is that I find the bar too high (and the time is too little) to submit a solution this time :-) .

Anyway, happy XMas time everyone!

AJAX Scraping with scRUBYt! – LinkedIn, Google Analytics, Yahoo Suggestions

scrubyt-logo-transparent.png As announced on the scRUBYt! blog, there is a brand new release of scRUBYt!, (among other additions) enabling AJAX scraping. I’d like to present a few examples of kicking the data out of non-trivial-to-scrape pages: LinkedIn, Google Analytics and Yahoo (which in itself is not a big deal – unless you want to scrape the suggestions that pop up after entering a keyword into the search text field).

Without further ado, let’s get down to business!

LinkedIn

Let’s say you’d like to scrape your LinkedIn contact list – first name, last name and e-mail of every contact you have. What makes this task complicated (but not for scRUBYt!) is that the contact list is inserted with AJAX after the page is loaded into the browser, and thus it is ‘invisible’ to a standard HTML parser like Hpricot/Nokogiri, so don’t try with those. Instead, check out how you might do it with scRUBYt!:

property_data = Scrubyt::Extractor.define :agent => :firefox do
  fetch          'https://www.linkedin.com/secure/login'
  fill_textfield 'session_key', '*****'
  fill_textfield 'session_password', '*****'
  submit

  click_link_and_wait 'Connections', 5

  vcard "//li[@class='vcard']" do
    first_name  "//span[@class='given-name']"
    second_name "//span[@class='family-name']"
    email       "//a[@class='email']"
  end
end

puts property_data.to_xml

Result: for the above records:


linkedin.png



is the following:

  
    Alex
    Combas
    *** alex's email ***
  
  
    Peter
    Cooper
    *** peter's e-mail ***
  
  
    Jim
    Cropcho
    *** jim's e-mail***
  

The magick is happening on line 7: you click the ‘Connections’ link and wait 5 seconds, until the list is loaded with AJAX. Then you can scrape the contacts as you would do normally.

Frames won’t stop us – Google Analytics

Besides being AJAXy, google analytics throws some more complexity into the mix: The login fields are in a frame, which is again not trivial to scrape – fortunately scRUBYt! abstracts all that frame handling away and makes this really easy:

data = Scrubyt::Extractor.define :agent => :firefox do
  fetch 'https://www.google.com/analytics/reporting/login'
  frame :name, "login"

  fill_textfield 'Email', '*****'
  fill_textfield 'Passwd', '*****'
  submit_and_wait 5

  pageviews "//div[@id='PageviewsSummary']//li[@class='item_value']", :example_type => :x path
end

puts data.to_xml

All you had to do is to ‘go into’ the frame named login. It looks like any navigation step (and basically we can consider it one) after which the scraping is executed on the document in the frame.
We again used an _and_wait method – it takes some time until everything is loaded after logging in.

Scaping an AJAX pop-up

Technically this is not much different from the first scenario, but it’s interesting nevertheless. The task is to scrape the suggestions that yahoo pops up after you enter something into the search field:


yahoo.png

Here is the scraper:

require 'rubygems'
require 'scrubyt'
require 'cgi'

Scrubyt.logger = Scrubyt::Logger.new


yahoo_data = Scrubyt::Extractor.define :agent => :firefox do
  fetch 'http://www.yahoo.com'
  fill_textfield_and_wait 'p', 'ruby', 5

  suggestion_list "//div[@id='ac_container']//li/a", :example_type => :x path do
    href "href", :type => :attribute do
      escaped_string /&p=(.+?)$/ do
        suggestion lambda {|x| CGI::unescape(x)}, :type => :script
      end
    end
  end
end

p yahoo_data.to_hash

The result:

[{:suggestion=>"ruby tuesday"}, 
 {:suggestion=>"pokemon ruby"},
 {:suggestion=>"ruby bridges"},
 {:suggestion=>"max and ruby"},
 {:suggestion=>"ruby falls"},
 {:suggestion=>"ruby rippey tourk"}, 
 {:suggestion=>"ruby lane"},
 {:suggestion=>"pokemon ruby cheats"},
 {:suggestion=>"ruby skye"},
 {:suggestion=>"ruby lin"}]

You can download the above (and other) examples from the scRUBYt! examples github repository:

git clone git://github.com/scrubber/scrubyt_examples.git

Please check out scRUBYt’s homepage for more info!

Ruby Quiz – Mix and Match

rq_candles.png
Solved another Ruby Quiz:


I purchased a number of scented candles recently for sending out to friends and family. While I could be accused of being lazy by getting candles for several people, I’d like to mix up the candles a bit so that each recipient gets a different combination of scents.

Please help me out! Your task is to write a method that randomizes and mixes up the individual candles into groups, one per recipient, in order to minimize group duplication. So, for example:

candles = { :o range   => 3,
            :vanilla  => 2,
            :lavender => 2,
            :garden   => 4 }

recipients = %w(janet nancy susan)

candles_per_recipient = 3

mix_and_match(candles, recipients, candles_per_recipient)

=> { "janet" => [:garden, :lavender, :o range],
     "nancy" => [:garden, :o range, :vanilla],
     "susan" => [:garden, :lavender, :vanilla],
     :extra  => { :o range   => 1,
                  :vanilla  => 0,
                  :lavender => 0,
                  :garden   => 1
                 }
    }


If it is impossible to have a unique combination for every recipient, you should still generate some set of combinations, minimizing repetition of combinations.

If the number of recipients times the number of candles per recipient is more than the supply, generate an error.

Proposed solution: (for the impatient, the source code is here.) In my interpretation, this is a simple combinatorial problem: say the number of recipients is r and candles\_per\_recipient is c, then you are looking for a (preferably non-repeating) random selection of r elements of c-combinations of the original set of candles. (In fact it’s a bit more complicated than that: the c-combinations have to be recalculated from the remaining candles each time you give away a group of candles, so we’ll get to that). Sounds confusing? Don’t worry, after the implementation everything will be clear!

So first, define a k-combination for a histogram (a Hash like candles above, where keys are elements and values are cardinalities):

class Hash
  def comb(group_size)
    result = []    
    inner_comb = lambda do |head,tail|
      tail[0..-(group_size-head.size)].each do |e|
        if (head.size >= group_size-1)
          tail.each {|t| result << head + [t]}
        else
          inner_comb[head + [e], tail[tail.index(e)+1..-1]] 
        end
      end
    end
    inner_comb[[],self.inject([]) {|a,v| v[1].times{a << v[0]}; a}]
    result.uniq    
  end

e.g.:

candles = { :o range   => 2,
            :vanilla  => 1,
            :lavender => 1, 
            :garden => 1 }
            
pp candles.comb(3)

=> [[:lavender, :garden, :o range],
    [:lavender, :garden, :vanilla],
    [:lavender, :o range, :o range],
    [:lavender, :o range, :vanilla],
    [:garden, :o range, :o range],
    [:garden, :o range, :vanilla],
    [:orange, :o range, :vanilla]]

so for a set of candles, this method generates all possible 3-combinations of the candles. We can then pick one and assign it to one of the recipients. Then recalculate the above from the remaining candles, give it to the next recipient - and so on and so forth. That's the basic idea, but we also need to ensure the candle combinations are as non-repeating as possible. So let's define some further utility methods:

class Hash
  def remove_set(set)
    set.each {|e| self[e] -= 1}
  end  
end

The above code adjusts the number of candles in the original hash once we give away some of them. So for example:

candles = { :o range   => 2,
            :vanilla  => 1,
            :lavender => 1, 
            :garden => 1 }
            
candles.remove_set([:orange,:orange,:lavender])
p candles
=> {:lavender=>0, :garden=>1, :o range=>0, :vanilla=>1}

and some Array extensions:

class Array 
  def rand
    uniqs = self.select{|e| e.uniq.size == e.size}
  uniqs.empty? ? self[Kernel.rand(length)] : uniqs[Kernel.rand(uniqs.length)]
  end
  
  def unordered_include?(other)
    self.map{|e| e.map{|s| s.to_s}.sort}.include? other.map{|s| s.to_s}.sort
  end  
end

Array#rand is trying to pick a random non-repeating combination if there is one (so e.g. [:orange, :lavender, :garden]) or, if there is no such combination, then just a random one (e.g. [:orange, :o range, :garden] - orange is repeating, but we have no other choice).

Array#unordered_include? is like normal Array#include?, but disregards the ordering of the elements. So for example:

  [[:lavender, :garden, :o range]].include? [:lavender, :o range, :garden] => false
  [[:lavender, :garden, :o range]].unordered_include? [:lavender, :o range, :garden] => true

Hmm... it would have been much more effective to use a set here rather than the above CPU-sucker, but now I am lazy to change it ;-)

OK, so finally for the solution:

ERROR_STRING = "The number of recipients times the number of candles per recipient is more than the supply!"

def mix_and_match(candles, recipients, candles_per_recipient)
  return ERROR_STRING if ((candles.values.inject{|a,v| a+v}) < (recipients.size * candles_per_recipient))
  candle_set = recipients.inject({}) do |a,v|
    tried = []
    tries = 0
    loop do
      random_pick = candles.comb(candles_per_recipient).rand
      tried << random_pick unless tried.unordered_include? random_pick
      break unless a.values.unordered_include? random_pick
      break if (tries+=1) > candles.values.size * 2
    end
    candles.remove_set(tried.last)
    a[v] = tried.last
    a  
  end
  candle_set.merge({:extra => candles})
end

So, in the inner loop we randomly pick a candles-per-recipient-combination of all the possible combinations; If no one has that combo yet, we assign it to the next recipient. If someone has it already, we try to find an unique combination (loop on), unless it is impossible (checked on line #12). In this case we simply start giving out any combinations. Once we give away a set of candles, we remove them from the original set. Easy-peasy.

You can check out the source code here.

This was a great quiz, too bad that not many people took a stab at it (so far 1 except me ;-) ). The hardest part for me was the implementation of the k-combination (and the result looks awful to me - I didn't check any algorithm/pseudocode/other solution though, I wanted to roll my own) - after that the problem was pretty simple. Cheers for the Ruby Quiz guys (== ["Matthew Moss"] I guess?) for this quiz.

Add a powerful AJAX Table to your Rails Application in 5 minutes

jq.png
I needed to add an AJAX grid/table component with all the bells and whistles (AJAX sorting, pagination, multiple row select, AJAX add/delete etc) to an application I am working on right now (will blog about it when we roll out a good-enough version). We are using jQuery so I started looking for a suitable plugin.

I believe I have found it – it’s called jqGrid and it’s super sexy, feature rich and the documentation is on par with a commercial tool – so I created a Rails plug-in enabling you to add it to your Rails app in no time! Follow the 10 steps below to find out how.

  1. Create a blank Rails app (in your home dir – or change the path in the next step):
    rails -d mysql grid_test
    
  2. install jQuery:
    curl http://jqueryjs.googlecode.com/files/jquery-1.2.6.pack.js > ~/grid_test/public/javascripts/jquery.js
    
  3. install jquery\_grid\_for_rails – I am using giternal (how to install), so this is what you need to do in this case:
    open config/giternal.yml and enter:

    jquery_grid_for_rails:
      repo: git://github.com/scrubber/jquery_grid_for_rails.git
      path: vendor/plugins
    

    then run

    giternal update
    

    (obviously you can use script/plugin, git submodules, piston, braid or whatever floats your boat)

  4. Generate a migration to test out stuff with:

    script/generate resource person
    
    class CreatePeople < ActiveRecord::Migration
      def self.up
        create_table :people do |t|
          t.string :first_name, :last_name, :title, :i_can_has_cheezburger 
        end
      end
    
      def self.down
        drop_table :people
      end
    end
    
  5. generate dummy data

    script/generate migration create_dummy_people_data_migration
    
    class CreateDummyPeopleDataMigration < ActiveRecord::Migration
      def self.up
        Person.create :first_name =>"He", :last_name => "Man", :title => "Hero", :i_can_has_cheezburger => "Sure"
        Person.create :first_name =>"Bat", :last_name => "Man", :title => "Mr.", :i_can_has_cheezburger => "Yeah"
        Person.create :first_name =>"Cat", :last_name => "Woman", :title => "Ms.", :i_can_has_cheezburger => "Yes"
        Person.create :first_name => "Super", :last_name => "Man", :title => "d00d", :i_can_has_cheezburger => "Nope"
        Person.create :first_name => "Spider", :last_name => "Man", :title => "Mr.", :i_can_has_cheezburger => "Meh"
        Person.create :first_name => "Chuck", :last_name => "Norris", :title => "Sir", :i_can_has_cheezburger => "Who is asking?"
        Person.create :first_name => "G.I.", :last_name => "Joe", :title => "Sgt", :i_can_has_cheezburger => "What is a cheezeburger?"
      end
    
      def self.down
        Person.destroy_all
      end
    end
    
  6. Time to run the migrations!

    Set up the db first:

    rake db:create
    

    run teh migrations:

    rake db:migrate
    
  7. Set up the controller (PeopleController) - add this method:

      def grid_data
        @people = Person.all(:order => "#{params[:sidx]} #{params[:sord]}")
        
        respond_to do |format|
          format.xml { render :partial => 'grid_data.xml.builder', :layout => false }
        end    
      end
    
  8. Modify config/routes.rb to look like

    map.resources :people, :collection => {:grid_data => :get}
    
  9. This goes into your application layout (create a new file - views/layouts/application.html.erb):

    
    	
    		
    		
    		<%= javascript_include_tag 'jquery' %>
    		<%= include_jquery_grid_javascript %>
    		<%= include_jquery_grid_css %>
    	
    	
    		<%= yield %>
    	
    	
    
  10. The last step: views (views/people/_grid_data.xml.builder, views/people/index.html.erb)

    xml.instruct! :x ml, :version=>"1.0", :encoding=>"UTF-8"
    xml.rows do
      xml.page params[:page]
      xml.total_pages (@people.size.to_i / params[:rows].to_i)
      xml.records{@people.size}
      @people.each do |u|
        xml.row :id => u.id do
          xml.cell u.title
          xml.cell u.first_name
          xml.cell u.last_name
          xml.cell u.i_can_has_cheezburger
        end
      end
    end
    
      <%= jquery_grid :sample, {:url => grid_data_people_url } %>
      
    My cool AJAX grid!
    <%= jquery_grid_table %> <%= jquery_grid_pager %>
  11. That's it! Start script/server, point your app to http://localhost:3000/people and if you did everything according the tutorial, you should (hopefully) see something like this:


    ajax_grid.png

    I have uploaded the app to github, be sure to check it out (WARNING - don't copy and paste from the above code, the code highlighting plugin has some problems and you'd get strange results. Clone the repo instead).

    Note that the installed plug-in is included in .gitignore so you have to run "giternal update" after you clone it.

    Drop me a comment if you experience any problems!

Thinking About Switching to jQuery?

jquery.pngI was pondering the question too – fortunately not for long! I am in the jQuery camp now and couldn’t be happier about it. Now, I am not saying it is for everyone, neither that it is superior to prototype / script.aculo.us or anything like that (neither with nor without Rails) – all I am saying is that I switched and didn’t regret it! As my earlier research based on Rails Rumble suggested, more and more people are switching to jQuery (as well as companies like Microsoft and Nokia), so why couldn’t you give it a spin? Here are some articles and stuff I found useful when I was in your shoes:

What’s the Great Idea?!

Coming from Prototype? Who isn’t…

Going Further

Tools

Books

  • jQuery in action – I am reading this book at the moment and it totally kicks ass.

What are you waiting for? Enter the red tower of Hanoi – you won’t be disappointed!

Simple Ideas, Great Usability

Sometimes even the simplest things can bring a big boost in usability. A small sampler I came across recently:

letsfreckle.com: clear text password field – which you can of course turn into a regular one by checking a checkbox in case someone is standing behind your back with the intention of stealing your future freckle password (only God knows what could happen then)! Beware of the (web 2.0) ninjas though, they can hide pretty well – so be sure to turn on the password hiding if you are sitting in a black room!





freckle.png




omgbloglol.com: If you comment on an article, you can check a checkbox stating that you’d like to receive an e-mail if anyone makes a new comment. I’d like to have this everywhere (yeah I know about services like www.cocomment.com and co.mments.com but I found them clumsy compared to the above solution)





email_me_comment.png




news.ycombinator.com: Simplest. Signup Form. Ever.





news_ycombinator.png




It would be great if other sites would realize that you don’t necessarily need your mom’s maiden name and similar info to sign up – it’s perfectly ok to supply that later on.

It’s That Time of the Year Again: Ruby Advent 2008

ruby_advent.png
advcal.png
Lakshan Perera, a Ruby developer from Sri Lanka started a new project for the holiday season: Ruby Advent Calendar (although it’s not uniquely new – you might remember Ruby Inside’s Advent 2006). I am really happy to see this project – I learned a lot about Ruby with the Ruby Inside advent calendar too.

Lakshan is looking for contributors (check it out here) so if you have an interesting Ruby trick (or two) you’d like to share (intended audience: beginners to intermediate developers), contact him via the blog entry. There is a low traffic (1/day for the next 24-3 days) twitter feed you can subscribe to.

Happy Ruby Advent 2008 everyone :-)

Hot in Edge Rails: Generate Rails Apps from Templates

Update: Pratik covers rails templates in more detail here.

I hate to use uber-superlatives, but this is just plain frakkin awesome to the power of a Chuck Norris roundhouse-kick: Jeremy McAnally‘s rg (don’t worry if you have no clue about rg – I heard about it 10 minutes ago myself (via Pratik)) made it to the rails core.

So what the hell is rg to begin with? rg is a rails generator, like the ‘rails’ command (the one used to create a blank Rails app) just way cooler. You think it has never been easier to create a skeleton app because we have bort? Well, then check this out:

bort_template.png
(Eyes glazing over? Check it out (as well as other examples) on github).

It’s hard to believe, but the above template is able to generate a clean rails app which is basically bort (Or exactly bort. Or very similar to bort. Or similar to a bort lookalike. Or… I guess I leave this debate up to ruby lawyers, bort experts, rg zealots etc.) Anyway, it doesn’t really matter as rg was not invented to compete with bort – bort is merely used as an example because it’s probably the most popular Rails skeleton app nowadays – rg is far more general than that: an easy, concise, Rubyish way to describe your Rails app (including plugins, vendored gems, lib files, initializers and whatnot) in a very straightforward way.

There are a handful of base Rails apps out there – but the chance that one of them is *exactly* using the tools you are committed to is minimal. For example I prefer HAML to erb, SASS to CSS, jQuery to Prototype, somewhat undecided between Shoulda and rSpec, I’d like to give authlogic a spin the next time I need user authentication etc. and the list goes on and on – and it changes over time. With this addition to Rails this is no problem – it takes minutes to cook your own Rails skeleton app, and minutes to adjust it later.

If this is still not enough, check out the Jeremy’s TODO list: Save yourself the hassle of keeping your core components up to date (coming soon). OMG, I have been waiting for something like this for ages, and I doubt I am alone (though the ability to describe gems in environment.rb was a great leap forward) – this will really rock! Kudos Jeremy!

Forget Rock Stars, Gurus, Ninjas and Zen Masters

I thought it’s impossible to significantly enhance this collection, at least with something similarly cyber-h4xx0r-zen-ninja-ish. Well, I am not so sure any more (via linkedin):



code_terrorist.png


Maybe with the Web2.0 bust (?), popular clichés for uber-giga-great developers are fading out too – indeed, who needs an old-fashioned web2.0 ninja (even if he’s a pirate in his free time) if you can hire a Web2.5 code terrorist to…. umm… get some coding done?!?

Update: The article was featured on hacker news and now it’s clear that Giles’s original article can be significantly extended, for example with:

  • Python 3000 Jihadists
  • JavaScript Martyrs
  • Ruby Bionic Commando
  • Rails Black Magicians

and others… check out hacker news if this is still not enough inspiration for you!

Decorating Instance Methods of a Class

scrubyt_logo.png
I am just working on a brand new release of scRUBYt!, with the intent of bringing AJAX/javascript scraping to the masses (and other great stuff – will announce the release soon).

Scraping js-heavy pages is not that trivial, among other things because of the asynchronous nature of Javascript. Quite frequently you click on a link (or do any other action triggering an AJAX call) which inserts/fills/pops up a div on the page and you want to navigate/get some data from the new content. However, it’s hard to impossible to determine when did the browser finish displaying the new data – the easiest solution is to wait a few seconds after an AJAX update, until the data is properly loaded.

In practice this means that all scRUBYt! navigation methods (click_link, fill_textfield, check_checkbox, select_item, …) need a decorated version, which waits a given amount of time after executing the navigation step. So for example, given the original method:

def click_link(xpath) ... end

we want a decorated version:

def click_link_and_wait(xpath, seconds)
  click_link xpath #the original method
  sleep seconds if seconds > 0
end

For each and every method of the NavigationAction class.

Fortunately Ruby makes this really easy! Decorating the existing methods explicitly upon class creation:

  (instance_methods - Object.instance_methods).each do |old_method|
    define_method "#{old_method}_and_wait" do |seconds|
      send old_method ; sleep seconds
    end
  end

or implicitly, runtime:

alias_method :throw_method_missing, :method_missing

def method_missing(method_name, *args, &block)
  original_method_name = method_name.to_s[/(.+)_and_wait/,1]
  if (method_name.to_s =~ /_and_wait/) && (respond_to? original_method_name)
    self.send original_method_name ; sleep args[0]
  else
    throw_method_missing(method_name, *args, &block)
  end
end

As you can see, we are executing the decorated method only if it is defined on our class and it ends in _and_wait. In all other cases we simulate the normal method_missing_behavior.

Since we know in advance that we want to decorate all the methods of the class, the second way doesn’t make much sense in this case – it’s slower because it has to go through method_missing every time, while there are no advantages – however the technique is interesting and applicable in other scenarios (e.g. when you don’t know in advance which methods are you going to decorate – for example adding a constraint to a filter in scRUBYt! (filters are also created dynamically runtime))

Git External Dependency Management with Giternal

git_octocat.png
I Don’t want to reiterate what has already been said on this topic (I ended up in the same boat – tried more tools, didn’t really like any of them and settled for giternal).

Giternal seems to work nicely – after you get the initial roadblock out of the way. According to the README, you should install giternal with

gem install giternal

then use the _giternal_ executable to do various things.

The problem is that after installation, there is no _giternal_ executable – the problem is that gem installs version 0.0.1, which doesn’t have it. You have to get the sources from git and set it up yourself:

git clone git://github.com/pat-maddox/giternal.git
cd giternal
sudo ruby setup.rb

and you are good to go.

Hope this saves someone a few minutes.

Ruby Quiz: Unit Conversions

ruby_quiz.png I decided to join Ruby Quiz for the first time ever – I am realizing that I am extremely late to jump on the RQ bandwagon – but hey, is still better later than never. I have chosen a funny moment though: it’s a quiz everyone was excited about, but almost nobody solved (2 solutions so far except mine, which is maybe the lowest number of solvers ever). So here’s the quiz:

Your task is to write a units converter script. The input to the
script must be three arguments: the quantity, the source units, and
the destination units. The first example above would be run like this:

    $ ruby convert.rb 50 miles kilometers

Or, using abbreviations:

    $ ruby convert.rb 50 mi km

Support as many units and categories of units (i.e. volume, length,
weight, etc.) as you can, along with appropriate abbreviations for
each unit.

and my solution:

require 'rubygems'
require 'cgi'
require 'scrubyt'

begin
google_converter = Scrubyt::Extractor.define do
 fetch "http://www.google.com/search?q=#{CGI::escape(ARGV[0])}+#{CGI::escape(ARGV[1])}+to+#{CGI::escape(ARGV[2])}"

 google_result "//td[@dir='ltr']" do
   final_result(/= (.+) /)
 end
end
 puts google_converter.to_hash[0][:final_result]
rescue
 puts "Sorry, even *google* can't translate that!"
end

Examples:

ex:
ruby converter.rb 10 "meter per second" "mile per hour"
22.3693629

ruby converter.rb 10 USD EUR
7.91201836

ruby converter.rb 7 "ruby gems" "python eggs"
Sorry, even *google* can't translate that!

The biggest downside of the proposed solution is that you need to be on-line. However, nowadays, when this is almost the norm, the advantages outweigh this in my opinion:

  • Support for a lot of conversions (I doubt anyone can offer a much richer database of units than google) including their abbreviations
  • Up-to-date conversions: for example currency conversion
  • Very robust error handling – outsourced to google! Hard to beat that…