I am just working on a brand new release of scRUBYt!, with the intent of bringing AJAX/javascript scraping to the masses (and other great stuff – will announce the release soon).
Scraping js-heavy pages is not that trivial, among other things because of the asynchronous nature of Javascript. Quite frequently you click on a link (or do any other action triggering an AJAX call) which inserts/fills/pops up a div on the page and you want to navigate/get some data from the new content. However, it’s hard to impossible to determine when did the browser finish displaying the new data – the easiest solution is to wait a few seconds after an AJAX update, until the data is properly loaded.
In practice this means that all scRUBYt! navigation methods (click_link, fill_textfield, check_checkbox, select_item, …) need a decorated version, which waits a given amount of time after executing the navigation step. So for example, given the original method:
def click_link(xpath) ... end
we want a decorated version:
def click_link_and_wait(xpath, seconds)
click_link xpath #the original method
sleep seconds if seconds > 0
end
For each and every method of the NavigationAction class.
Fortunately Ruby makes this really easy! Decorating the existing methods explicitly upon class creation:
(instance_methods - Object.instance_methods).each do |old_method|
define_method "#{old_method}_and_wait" do |seconds|
send old_method ; sleep seconds
end
end
or implicitly, runtime:
alias_method :throw_method_missing, :method_missing
def method_missing(method_name, *args, &block)
original_method_name = method_name.to_s[/(.+)_and_wait/,1]
if (method_name.to_s =~ /_and_wait/) && (respond_to? original_method_name)
self.send original_method_name ; sleep args[0]
else
throw_method_missing(method_name, *args, &block)
end
end
As you can see, we are executing the decorated method only if it is defined on our class and it ends in _and_wait. In all other cases we simulate the normal method_missing_behavior.
Since we know in advance that we want to decorate all the methods of the class, the second way doesn’t make much sense in this case – it’s slower because it has to go through method_missing every time, while there are no advantages – however the technique is interesting and applicable in other scenarios (e.g. when you don’t know in advance which methods are you going to decorate – for example adding a constraint to a filter in scRUBYt! (filters are also created dynamically runtime))
You might want to check out how the clickAndWait methods are implemented in Selenium. We’ve also got some custom code kicking around somewhere to wait for all AJAX requests to finish, there’s a method available (name I can’t recall) which tells you how many waiting/active connections there are.
Wow this was potentially one of the most intelligent articles I have come across on the topic so far. I don’t know where you learn all of your data but I am impressed! I’m gunna send some people to this site to take a look at this post. Awesome, just plain fantastic. I’m have just started getting into writing articles myself, nothing remotely close to your writing skills (lol) but I would love for you to take my stuff someday! bowfelx series 7 treadmill
sometimes it is hard to write original articles for blogs but i always seems to manage creating fresh original content for blogs:-`
writing articles is what i do as part time job, i enjoy writing and rewriting””,
i love to write articles coz it enhances my grammatical skills-.-
i can write severa articles in one day as long as it is related to health and medicine which is my favorite topic`:,