Cool JS Shadow Library

Posted on April 25, 2008 by peter

Light & Shadow is a ProtoType-based library for creating great looking drop shadows easily. Check out the HTML I used to generate the example image and see it yourself that it’s not rocket science!

All you have to do is to set up a light source with a few parameters (distance, intensity, size etc.) and add the class ‘shadowThrowing’ to your elements which should… well, throw shadows :-). I won’t go into details here, you can find the explanation and other details on the Light & Shadow project page.

(Found via Gedankenkonserve – thanks Bernhard ;-))

Needle in the Haystack – Information Overloading 2.0

Posted on April 19, 2007 by peter

Do you also have the feeling that you are totally drowning under the unbelievable amount of information that is emited by the Web today? (and by other media as well, which emphasizes this greatly, but I would like to focus solely on the Web aspect in this article). I feel more and more frustrated day by day, trying to stay on top of my ever-growing heap of unopened e-mails, undread blog entries, unchecked news sites etc. with a constant fear that though I spend a fair amount of time to consume and process all the information pouring in, I am still missing something very important all the time.

The “problem” is that there are way too many outstanding blogs, aggregators, social new sites, bookmarking service popular links and other sources of information which you “just can not miss”. I fear I am definitely losing the battle – there are more and more information sources, but no new, more effective methods (at least I don’t know about them) to handle them, so I guess it’s pretty clear that as time is progressing, more and more info will fall through the cracks (or spending more and more time will be needed to prevent this).

Since there is no way to stop the exponential growth of information (and if there would be, I doubt anybody would want to utilize it – this is just not the way this problem should be approached), we have to influence the other factor: find more effective means of locating, sorting, bookmarking, processing and otherwise handling the most important data.

It is interesting to observe that at the moment, services with this intention are not really receiving as much attention as they should – provided that the above reasoning is sound and thus there is a need for more effective handling of existing information. Google is a trivial example of this: it has loads of interesting tricks to refine, specify and narrow your search (like for example the synonym operator, ~, or other advanced options) – yet I bet 2 hours of my most precious blog-reading time that most of us can not even tell when did we use advanced search for the last time (besides a few trivial ones entered to the search box, like site:.rubyrailways.com). In most of the cases I never check out more than 2-3 result pages (and just the first page in 90% of the time) – which is interesting, given that I am doing 99% of my searches on google!
In my opinion, exactly the opposite is true: Sites like twitter or tumblelog are immensely popular, flooding you with even more and more information, all the time, every minute, as fast as possible etc. You did not have enough time to read blogs? No problem, here are tumblelogs and twitter messages, which will help you by shooting even more data right into your face much more frequently than ever. Welcome to information overloading 2.0.

Fortunately there is hope on the horizon: some sites are striving to help the situation by providing interesting view on the data, narrowing down the information to a specific niche, or aggregating and presenting it in a way so that you do not have hand-pick it from an enormous set of everything-in-one-bag infosoup. I will try to describe a few of them which I have found interesting recently.

Tools utilizing visual representation of data – People are visual beings. In most of the cases, a few good, to-the-point pictures or diagrams can tell much more than boring pages of text. Therefore it is quite intuitive that visual representation of data (typically result of search engine queries) could help to navigate, refine and finally locate relevant results compared to text-only pages.

My current favorite in this category is quintura. Besides working as a normal yahoo search, quintura does a lot of other interesting things: it finds related tags to your query and displays them as a tag cloud. You can further refine the search results or navigate to documents found by any of the related tags. Hovering over the related tags displays the related tag for that tag. For example, searching for web scraping, and hovering over the ‘ruby’ related tag, ‘scrubyt’ is also displayed – it would definitely take more time to find scrubyt on google, even by using the search term combination ‘web scraping ruby’ – so the functionality offers more than just a fancy view, it actually speeds up and makes searching faster and more effective.

Am I using quintura regularly? Nope. Given that I have just stated a few sentences ago that it can speed up and make searching faster and more effective’ this is strange – but for some reason, if I am after something, I am trying to find it on google.com. This is rather irrational, don’t you think so?
Sites concentrating on a specific niche – I feel that (otherwise great) sites like digg are just too overcrowded for me: with over 5000 submissions a day in a lot of diverse categories it simply takes too much time to read even just the front page stories. I am mainly interested in technology and development related articles – and while a lot of digg proponents are arguing that there are both technology and programming categories on digg, they are still too ‘mainstream’ for my taste and rarely catering to a ardcore developer/hacker in my opinion.
Fortunately dzone and tektag are here to rescue the situation!

The guys over at dzone are really cranking all the time to bring a great digg-like site for developers that helps you to stay on top of the current development and technology trends. The community (which is crucial in the case of such a site of course) is really nice and helpful, and in my opinion the site owners have found (and are consantly fine-tuning) the right magic formula to keep the site from being overloaded with redundant information but still delivering the most relevant news and stuff. Currently, dzone is my no 1. source of developer and tech news on the web.

In my opinion, tektag did not reach the maturity level of dzone yet (I think they are currently in public beta), but once this will happen, I bet it would be a very important and relevant source of information for developers, too. To put it simple, tektag is to del.icio.us what dzone is to digg. Why is this so great? If you need to bookmark something, you should just use del.icio.us, right? Wrong – at least if you intend to use del.icio.us in any other way than store your personal bookmarks. The problem with del.icio.us again is that people are using to bookmark just anything with it – therefore it is virtually impossible to track the movers and shakers in a narrow topic (like programming). Visiting del.icio.us/popular will show you what’s being bookmarked the most overall, not inside your category of interest (of course I know there are categrories like del.icio.us/popular/programming, but these still do not solve the situation fully by far).
Tektag has the potential to solve this situation by adding development-specific features and tweaks, but most importantly by the fact that only developer articles will be saved here and thus interpreting the data will me much more easy since the input won’t be cluttered with an enormous amount of information from arbitrary topics. In my opinion the only question of their succes is: can they build the critical user mass?
Semantic search – if you hear the word ‘search engine’ most probably google or one of it’s competitors (yahoo, msn) springs to your mind, and you are right – for the absolute majority of the searches, we are using these sites. However, they are not really that powerful in letting you express what are you searching for exactly (and of course, stemming from this fact, actually bring you the right results) because they are not trying to understand the documents on the Web: they just crawl and index them to be searchable with phrases they contain.
Since the above mentioned sites are still the absolute market leaders in search, It’s clear the keyword based indexing is still good enough(tm) – until somebody will show that there is a more sophisticated way of searching, by trying to apply natural language processing, ontology extraction and other semantic techniques to to actually understand the documents, and deliver usable results with these techniques.

Spock, an upcoming people search engine is a great example of this principle in action. Spock’s goal is to crawl the whole web and extract information about people – which is far from trivial – since to do this properly, their spiders have to be smart enough to understand human language as much as possible (A simple example: think of a birth date, e.g. 07/06/05 – is 07 denoting a day (meaning the 7th day in the month) or a year (the year 2007)? There are hundreds, maybe thousands of date formats used on the Web – and there are far more complicated problems to solve than this).
OK solving complex problems or not, what’s so cool about a people search engine? After all you can use ye good olde google as for everything else. Tim O’Reilly has an excellent example against this approach: on google, it’s trivial to find Eric Schmidt, google’s CEO – however it’s much harder to find the other 44 Eric Schmidts returned by spock. It’s not that google does not find them – but to actually locate them in as much as approximately 4,500,000 returned documents (as opposed to spock’s 45) is nearly impossible.
Spock is probably the best example in this article to demonstrate how a service should bring you all the information you need – and not even a bit more!

If these services are so promising and they help you to figt the information overloading, thus helping you to find desired information easier (so that you will have more time to read other blogs :-)), why they are less popular by magnitudes than the ones flooding you all the time? Why do not people use as simple things as advanced google search to fight information overloading? Is information overloading a bad thing at all (since it seems the sites generating the most information with the fastest pace are the most popular)? I can’t really answer these questions at the moment, but even if I could, I have to run now to read some interesting (tumble|b)logs. What!? 20 twitter messages received? Ok, seriously gotta go now…

Data Extraction for Web 2.0: Screen Scraping in Ruby/Rails, Episode 1

Posted on February 4, 2007 by peter

This article is a follow-up to the quite popular first part on web scraping – well, sort of. The relation is closer to that between Star Wars I and IV – i.e., in chronological order, the 4th comes first. To continue the analogy, probably I am in the same shoes as George Lucas was after creating the original trilogy : the series became immensely popular and there was demand for more – in both quantity and depth.

After I have realized – not exclusively, but also – through the success of the first artcile that there is need for this sort of stuff, I begun to work on the second part. As stated at the end of the previous installment, I wanted to create a demo web scraping application to show some advanced concepts. However, I left out a major coefficient from my future-plan-equation: the power of Ruby.

Basically this web scraping code was my first serious Ruby program: I came to know Ruby just a few weeks earlier, and I have decided to try it out on some real-life problem. After hacking on this app for a few weeks, suddenly a reusable web scraping toolkit – scRUBYt! – begun to materialize which caused a total change of the plan: instead of writing a follow-up, I decided to finish the toolkit and sketch a big picture of the topic as well as placing scRUBYt! inside this frame and illustrating the theoretical things with it described here.

The Big Picture: Web Information Acquisition

The whole art of systematically getting information from the Web is called ‘Web information acquisition’ in the literature. The process consists of 4 parts (see the illustration), which are executed in this order: Information Retrieval (IR), Information Extraction(IE), Information Integration (II) and Information Delivery (ID).

Information Retrieval

Navigate to and download the input documents which are the subject of the next steps. This is probably the most
intuitive step to make – clearly, the information acquisition system has to be pointed to the document which contains the data first, before it can perform the actual extraction.

The absolute majority of the information on the Web resides in the so-called deep web – backend databases and different legacy data stores which are not contained in static web documents. This data is accessible via interaction with web pages (which serve as a frontend to these databases) – by filling and submitting forms, clicking links, stepping through wizards etc. A typical example could be an airpot web page: an airport has all the schedules of the flights they offer in their databases, yet you can access this information only on the fly by submitting a form containing your concrete request.

The opposite of the deep web is the surface web – static pages with a ‘constant’ URL, like the very page you are reading. In such a case, the information retrieval step consist of just downloading the URL. Not a really tough task.

However, as I said two paragraphs earlier, most of the information is stored in the deep web – different actions, like filling input fields, setting checkboxes and radio buttons, clicking links etc. are needed to get to the actual page of interest which can be then downloaded as the result of navigation.

Besides that this is not trivial to do automatically from a programming language just because of the nature of the task, there are a lot of pitfalls along the way, stemming from the fact that the HTTP protocol is stateless: the information provided to a request is lost when making the next request. To remedy this problem, sessions, cookies, authorizations, navigation history and other mechanisms were introduced – so a decent information retrieval module has to take care about these as well.

Fortunately, in Ruby there are packages which are offering exactly this functionality. Probably the most well-known is WWW::Mechanize which is able to automatically navigate through Web pages as a result of interaction (filling forms etc.) while keeping cookies, automatically following redirects and simulating everything else what a real user (or the browser in response to that) would do. Mechanize is awesome – from my perspective it has one major flaw: you can not interact with JavaScript websites. Hopefully this feature will be added soon.

Until that happy day, if someone wants to navigate through JS powered pages, there is a solution: (Fire)Watir. Watir is capable to do similar things as Mechanize (I never did a head-to-head comparison, though it would be interesting) with the added benefit of JavaScript handling.

scRUBYt! comes with a navigation module, which is built upon Mechanize. In the future releases I am planning to add FireWatir, too (just because of the JavaScript issue). scRUBYt! is basically a DSL for web scraping with lot of heavy lifting behind the scenes. Through the real power lies the extraction module, there are some goodies here at the navigation module, too. Let’s see an example!

Goal: Go to amazon.com. Type ‘Ruby’ into the search text field. To narrow down the results, click ‘Books’, then for further narrowing ‘Computers & Internet’ in the left sidebar.

Realization:

  fetch           'http://www.amazon.com/'
  fill_textfield  'field-keywords', 'ruby'
  submit
  click_link      'Books'
  click_link      'Computers & Internet'

Result: This document.

As you can see, scRUBYt’s DSL hides all the implementation details, making the description of the navigation as easy as possible. The result of the above few lines is a document – which is automatically fed into the scraping module, but this is already the topic of the next section.

Information Extraction

I think there is no need to write about why does one need to extract information from the Web today – the ‘how’ is a much more interesting question.

Why is Web extraction such a tedious task? Because the data of interest is stored in HTML documents (after navigating to them, that is), mixed with other stuff like formatting elements, scripts or comments. Because the data is missing any semantic description, a machine has no idea what a web shop record is or how a news article might look like – it just perceives the whole document as a soup of tags and text.

Querying objects in systems which are formally defined and thus understandable for a machine is easy: For instance, if I want to get the first element of an array in Ruby, One can do it easily like this:

my_array.first

Another example for a machine-queryable structure could be an SQL table: to pull out the elements matching the given criteria, all that needs to be done is to execute an SQL query like this:

SELECT name FROM students WHERE age > 25

Now, try to do similar queries for a Web page. For example, suppose that you already navigated to an ebay page by searching for the term ‘Notebook’. Say you would like to execute the following query: ‘give me all the records with price lower than $400’ (and get the results into a data structure of course – not rendered inside your browser, since that works naturally without any problems).

The query was definitely an easy one, yet without implementing a custom script extracting the needed information and saving it to a data structure (or using stuff like scRUBYt! – which does exactly this instead of you) you have no chance to get this information from the source code.

There are ongoing efforts to change this situation – most notably the semantic Web, common ontologies, different Web2.0 technologies like taxonomies, folksonomies, microformats or tagging. The goal of these techniques is to make the documents understandable for machines to eliminate the problems stated above. While there are some promising results in this area already, there is a long way to go until the whole Web will be such a friendly place – my guess is that this will happen around Web88.0 in the optimistic case.

However, at the moment we are only at version 2.0 (at most), so if we would like to scrape a web page for whatever reason *today*, we need to cope with the difficulties we are facing. I wrote an overview on how to do this with the tools available in Ruby (update: there is a new kid on the block – HPricot – which is not mentioned there).

The rough idea of those packages is to parse the Web page source into some meaningful structure (usually a tree) then provide a querying mechanism (like XPaths, CSS selectors or some other tree navigation model). You could think now: ‘A-ha! So actually a web page *can* be turned into something meaningful for machines, and there *is* a formal model to query this structure – so where is the problem described in the previous paragraphs? You just write queries like you would in a case of a database, evaluate them against the tree or whatever and you are done’.

The problem is that the machine’s understanding of the page and human thinking about querying this information are entirely different, and there is no formal model (yet) to eliminate this discrepancy. Humans want to scrape ‘websop records with Canon cameras with maximal price $1000’, while the machine sees this as ‘the third <td> tag inside the eight <tr> tag inside the fifth <table> … (lot of other tags) inside the <body>> tag inside the <html> tag, where the text of the seventh <td> tag contains the string ‘Canon’ and the text of the ninth <td>, is not bigger than 1000 (to even get the value 1000 you have to use a regular expression or something to get rid of the most probably present currency symbol and other possible additional information).

So why is this so easy with a database? Because the data stored in there has a formal model (specified by the CREATE TABLE keyword). Both you and the computer know *exactly* how a Student or a Camera looks like, and both of you are speaking the same language (most probably an SQL dialect).

This is totally different in the case of a Web page. A web shop record, a camera detail page or a news item can look just anyhow and your only chance to find out for the concrete Web page of interest is to exploit it’s structure. This is a very tedious task on it’s own (as I have said earlier, a Web page is a mess of real data, formatting, scripts, stylesheet information…). Moreover there are further problems: for example, a web shop record must not be uniform even inside the same page – certain records can miss some cells which others have, may containt the information on a detail page, while others not and vice versa – so in some cases, identifying a data model is impossible or very complicated – and I did not even talk about scraping the records yet!

So what could be the solution?

Intuitively, there is a need for an interpreter which understands the human query and translates it to XPaths (or any querying mechanism a machine understands). This is more or less what scRUBYt! does. Let me explain how – it will be the easiest through a concrete example.

Suppose you would like to monitor stock information on finance.yahoo.com! This is how I would do it with scRUBYt!:

#Navigate to the page
fetch ‘http://finance.yahoo.com/’

#Grab the data!
stockinfo do
symbol ‘Dow’
value ‘31.16’
end

output:

  <root>
    <stockinfo>
      <symbol>Dow</symbol>
      <value>31.16</value>
    </stockinfo>
    <stockinfo>
      <symbol>Nasdaq</symbol>
      <value>4.95</value>
    </stockinfo>
    <stockinfo>
      <symbol>S&P 500</symbol>
      <value>2.89</value>
    </stockinfo>
    <stockinfo>
      <symbol>10-Yr Bond</symbol>
      <value>0.0100</value>
    </stockinfo>
  </root>

Explanation: I think the navigation step does not require any further explanation – we fetched the page of interest and fed it into the scraping module.

The scraping part is more interesting at the moment. Two things happened here: we have defined a hierarchical structure of the output data (like we would define an object – we are scraping StockInfos which have Symbol and Value fields, or children), and showed scRUBYt! what to look for on the page in order to fill the defined structure with relevant data.

How did I know I had to specify ‘Dow’ and ‘31.16’ to get these nice results? Well, by manually pointing my browser to ‘http://finance.yahoo.com/’, and observing an example of the stuff I wanted to scrape – and leave the rest to scRUBYt!. What actually happens under the hood is that scRUBYt! finds the XPath of these examples, figures out how to extract the similar ones and arranges the data nicely into a result XML (well, there is much more going on, but this is the rough idea). If anyone is interested, I can explain this in a further post.

You could think now ‘O.K., this is very nice and all, but you have been talking about *monitoring* and I don’t really see how – the value 31.16 will change sooner or later and then you have to go to the page and re-specify the example again – I would not call this monitoring’.

Great observation. It’s true scRUBYt! would not be of much use if the situation of changing examples would not be handled (unless you would like to get the data only once, that is) – fortunately, the situation is dealt with in a powerful way!

Once you run the extractor and you think the data it scrapes is correct, you can export it. Let’s see how the exported finances.yahoo.com extractor looks like:

#Navigate to the page
fetch ‘http://finance.yahoo.com/’

#Construct the wrapper
stockinfo “/html/body/div/div/div/div/div/div/table/tbody/tr” do
symbol “/td[1]/a[1]”
value “/td[3]/span[1]/b[1]”
end

As you can see, there are no concrete examples any more – the system generalized the information and now you can use this extractor to scrape the information automatically whenever – until the moment the guys at yahoo change the structure of the page – which fortunately not happening every other day. In this case the extractor should be regenerated with up-to date examples (in the future I am planning to add automatic regeneration in such cases) and the fun can begin from the start once again.

This example just scratched the surface of what scRUBYt is capable of – there are tons of advanced stuff to fine-tune the scraping process and get the data you need. If you are interested, check out http://scrubyt.org for more information!

Conclusion

The first two steps of information acquisition (retrieval and extraction) are dealing with the question ‘How to get the data I am interested in (querying)’. Up to the present version (0.2.0) scRUBYt! contains just these two steps – however, to do even these properly, I will need a lot of testing, feedback, bug fixing, stabilization, adding heaps of new features and enhancements – because as you have seen, web scraping is not a straightforward thing to do at all.

The last two steps (integration and delivery) are addressing the question ‘what to do with the data once it is collected, and how to do that (orchestration)’. These facets will be covered in a next installment – most probably when scRUBYt! will contain these features as well.

If you liked this article and you are interested in web scraping in practice, be sure to install scRUBYt! and check out the community page for further instructions – the site is just taking off, so there is not too much yet – but hopefully enough to get you started. I am counting on your feedback, suggestions, bug reports, extractors you have created etc. to enhance both scrubyt.org and scRUBYt! user experience in general. Be sure to share your experience and opinion!

To launch a tutorial site is comparatively much easier today than it was a few years ago. You can easily buy domain name at a very low cost and do domain parking until your site is ready. Get a good business hosting package from one of the many providers listed on the internet, go for a company which hires people with cisco certifications such as 642-143. Create a professional web design with the help of adobe. Get online training that can guide you through the site’s development. Use your laptop wireless internet connection to upload from anywhere conveniently.

Mind-boggling blogging

Posted on September 29, 2006 by peter

Tagline: Blogging is a very easy looking activity, until you _actually_ begin with it…

Most probably even the irregular readers of rubyrailways have noticed a 3 month period of silence during the summer, which has just ended a few days ago. In my opinion it is generally not a very good idea to temporarily abandon a blog, without even announcing a summer holiday or posting a note like “to be continued after an undefined period of blogger’s block” or something. Why did I allow it happen then?

Well, there are a handful of reasons for this: summer holidays, though days at the work, lot of stuff to do on my PhD but mainly a kind of a blogger’s crisis. Although all the reasons are very interesting, I would like to elaborate on the last one a bit.

The first problem stems from the relative success of my previous entries: Tutorials like Install Internet Explorer on Ubuntu Dapper in 3 easy steps, Data extraction for Web 2.0: Screen scraping in Ruby/Rails or Getting Ruby on Rails up and running on Ubuntu Dapper were quite popular and set a standard which was not easy to top (or at least to maintain) in terms of equally interesting topics.
Unfortunately I can pursue Ruby, Rails and even screen scraping/web extraction only in my spare time which is a scarce resource (it’s kind of hard to work full time, roll a PhD and blog simultaneously :-)) and therefore I do not bump into an interesting topic just every second day. However, this eventually got me into a kind-of an inverse Concorde-effect: If I have waited a week, then I can wait another to deliver something sexy. After a month: Now that I have waited a month, I surely have to come up with something *really* juicy… You get the idea.

I believe I am not the only one around with this thinking pattern, and I am not sure how are others handling this problem, but I have decided to give up this habit – in the future I would like to blog regularly, even at the cost that not every post will be a top-notch blockbuster :-).

The second problem is that I am kind of a renaissance guy: I am interested in new technologies, programming, science research, economics, reading books just about everything, photography, traveling, computer games, sports…
However, since rubyrailways is my first attempt at blogging, I am quite unsure how to deal with this amount of information: what should be the ratio of not-necessarily-correlated topics (e.g. Ruby, travelling and PhD research). I am nearly sure though that it is not a good idea to blog about everything, since then every post will be uninteresting for most of the readers.

Yes, I know that categories were invented to workaround this problem. However, in my experience most of the people today are using feed aggregators and/or personal start pages like bloglines, netvibes or pageflakes, and hence are facing this problem nevertheless. Yes, they could ignore the posts that are not interesting to them, but after doing so a few times they will potentially ignore your whole blog.
So how to find the golden mean?

A possible solution is to have a separate blog for everything: In my case this would mean at least a software development (mainly Ruby/Rails), general technology, Linux/Ubuntu, Science/PhD research and a travelling blog. Well, I certainly would not have the time to keep up all of them since I am struggling even with rubyrailways :-)… I could of course ignore what people think about my blog and just write it to myself, but that would deprive me from knowing what other people think about the things I am after, which is a very valuable information for me.

I would be very much interested in your opinion on this topic: How do you solve this ‘feature creep’ on your blog – by maintaining more blogs, focusing on just one topic and ignoring the others, or trying to balance somehow?

Please leave me a comment or send me a mail, I’d really like to hear your opinion…

Analyze this

Posted on June 16, 2006 by peter

Finally… After several months, my google analytics invitation has arrived.
Does it offer more than any ‘usual’ page statistics tool that can be found on the net?
Short answer: absolutely! For the detailed analysis of analytics read on…

My site is hosted at dreamhost, and they offer a pre-installed
logfile analyser, analog, which claims to be ‘The most popular logfile
analyser in the world’. It has a decent feature set (not too much graphical fancy stuff, but
nice analysis nevertheless), still i wanted to give a try to something different, too – so i have installed statcounter, ‘A free yet reliable invisible web tracker, highly
configurable hit counter and real-time detailed web stats’.

I have been quite satisfied with both statistics (although in the free version of statcounter, the log size
is limited to 100 hits) – until i have seen what google analytics is capable of.

The number of features that google analytics has to offer is HUGE. I am using it
for a week now, and there are still some statistics which i simply did not have time to look at. There are quick overview screens for everything important
(above you can see one of them) – and if you would like to drill down to every single hit, you have the possibility too.

Ever wanted to know everything about your visitors? No problem. You can view every single visititor’s referral link, which country/city did he come from (also displayed on the world map), connection speed, platform, browser, screen resolution (even color depth!), language, which keywords did lead them to you, their loyalty, conversion rate (i have listed just a small fraction of featues)… and all this presented with nice graphs, charts etc. Simply unbelievable.

I will not write anything more about this tool, since if you have it, you know what i am talking about, and if not, go and get it if you are interested in your web site stats!

My advice is: forget about ANY kind of stat counter, and request a google analytics account ASAP.

Announcing screen-scraping series

Posted on June 14, 2006 by peter

I am planning to write a series of entries on screen scraping, automated
Web navigation, deep Web mining, wrapper generation, screen scraping from
Rails with Ajax and possibly more, depending on my time and your feedback.
Since these entries are going to be longer, I will be posting them to
separate pages, and announce them on my blog.

The first article is ready, you can read it here.

It is an introduction to screen scraping/Web extraction with Ruby,
evaluation of the tools along with installation instructions and examples.

Feedback would be appreciated (leave your comment here/on the article page, or
send me a mail at peter@[name of this site].com), I will update/extend the
document and publish new ones based on your feedback.

Google Trends: Googlefight v2.0 and much more!

Posted on May 16, 2006 by peter

Every second blog I came across recently has an entry about google trends, so I am adding my small findings too! 😉

After playing with it for a few hours, I have to say that writing a relevant query is not always as easy as it seems. People are posting Java vs Python vs Ruby comparisons, but they are not always aware that the graph contains (among other things) the comparison of an island, a comedy troupe (Monty Python) and a character set (Ruby Characters), for example. According to wikipedia, all three terms have more than ten possible meanings, and although a tech nerd may know only one for each of them, not all pages out there are (fortunately) written by tech guys.

Let’s start with some Rails related stuff:

Well, I wonder who else recently (not even necessarily in the computer industry) got so famous in a matter of days…
It is interesting that there is no data available for “David Heinemeier Hansson” or even “David Hansson”, just for DHH.

The next graph could answer the question whether it is a good idea for a web hosting company today to support Ruby on Rails:

For the idea of the following googleTrendFight thanks for Laszlo on Rails blog.

It’s really thrilling to see that a (once) small open source community can compete with enterprise stuff of such magnitude as JBoss/EJB (ok, this is kind of apples-to-oranges, but nevertheless interesting). If you wonder why did JBoss’ search volume go dramatically up – it’s because RedHat bought the company.

Non-Rails related:
slashdot.com vs digg.com vs reddit.com:

No comment…

The last one about wikipedia, kind of funny:

Why should be this funy? Because the only point in the history (so far) when search volume for wikipedia was declining was because of:

Probably (hopefully?!?!) there is no direct link between these facts, but it is an interesting random coincidence then…

I wonder whether google will improve the quality of this search and/or add possibility to specify advanced queries to prevent mixing in of irrelevant results – at the moment, if I did try to narrow the search, in lot of cases i got back ‘data not available’… Interesting toy, though.

Ruby, Rails, Web2.0

Experiences with Ruby and Rails, Web2.0 and other development technologies

Category Archives: Web2.0