Random Links from the Web, 04-04-2008

EuRuKo 2008 – Favorite Quotes

EuRuKo 2008

Quite a few blogs covered EuRuKo talks (for example here, here, here or here) so I am not going to do an n+1-th writeup. Instead I collected a few quotes I found interesting and/or funny. Without further ado, here we go:

Matz: Ruby – Past, Present and Future (Keynote)

  • Chad Fowler wrote a book ‘My Job Went to India’ – and yesterday my luggage went to Oslo!
  • Ruby was a hobby that came from not having a job after end of Japan’s bubble economy
  • I consider myself as no great programmer!
  • Then we have this nasty snake language here… (sampling through different languages starting from COBOL)
  • Python people love to be organized and have one true way. Ruby people don’t care.
  • (after going through a ton of different languages and finally arriving at Ruby) Ruby is not perfect either… but it’s close!
  • We will need to keep Ruby alive when Rails will be gone… Hopefully Ruby will be around in 15 years, but Rails… hmm…khm… never mind.
  • 10 years ago – Ruby? What’s That? Language? See This Cool Java!
  • 5 years ago – Ruby? I’ve Heard of It. But I haven’t Used It YET.
  • 2 years ago – Ruby! I Know! It’s for Rails, Right?

Koichi: Merging YARV is not a goal but a start

  • (Running around in a t-shirt ‘No Ruby No Life’) My version: No Ruby no Job!
  • 1st slide: [some completely different slide than his real topic title] – this is just a joke, the presentation is not on this
  • I don’t speak English so I wrote down everything on the slides… unfortunately most of the slides are in Japanese! (LOL)
  • Please ask questions in Japanese/C/Ruby! … or use slow/short English
  • I have a lot of slides (hm, like 50) of different optimization techniques but it’s just too much so I won’t show you here ;-)
  • My PhD thesis is in Japanese (Efficient implementation of Ruby VM), so please learn Japanese if you’d like to read it :)
  • NT means not Windows NT but Native Thread
  • Ruby thread and native thread: (shows a brutally complicated slide, fully in Japanese) and I would like to point out this part (clicks a button, and a red frame appears around a portion of fully Japanese text)
  • This is complex so I skip it (after the 10th incomprehensible slide)
  • I can’t program in Ruby!
  • Come to Japan/Enter my Lab (please teach me English) – Unfortunately I can’t employ you because there is not enough money, so please bring your own $
  • (response to a question about YARV’s memory need:) Yes, YARV also needs memory to start the VM! (considers the question answered. When asked for further details:) How much? hmm… you need to measure yourself ;-)

Charlie Nutter, Tom Enebo: JRuby – Ready for Action

  • Java != a dirty word!
  • Did you see this demo already? No? Good. Everybody has seen this 100 times in the US so they are really sick of it!
  • See? new is on the right side of the class name!
  • I love swing (after listing 9342923 looooong fuctions of the Button class)
  • this would take 6-7 lines of Java (after adding a listener to a button in 1 lines of Ruby syntax)
  • This is the weirdest error I have ever seen during a demo! I guess Mac OS X is not ready for this stuff yet (after not being able to kill off a process which slowed down his machine)
  • Kill that bird!!!! (after thunderbird jumping around for like 5 minutes after OSX bootup)
  • ColdFusion is not a fine example for anything (Charlie, after a guy proposing CouldFusion as an example of… I don’t remember what but it doesn’t matter)



Disclaimer: Please take the above with a grain of salt – it might seem based on the above sentences (which were taken out of context and possibly squeezed/lost in translation) that Koichi’s topic was to pursue people to learn Japanese (not true, his talk was very interesting and deep) or Matz was making fun of/mocking Python (not true, he was just joking all the time) etc. If you have some more, drop me a comment!

My EuRuKo 2008 Photos

EuRuKo 2008 is over… I have had a really great time, both as an organizer and an attendee, and can’t wait for next year’s conference!

Until that gets sorted out (currently Spain (Madrid) and Poland (Warsaw vs Krakow) are competing) here are some photos we made with Marianna… They were usually made in a hurry and/or dark so don’t expect too much (I guess I should invest in a better objective and flash :-) )

You can check out all the (correctly tagged) EuRuKo 2008 photos here.

Please post your photos to flickr or whatever service you are using, and leave a comment here with the address… Cheers!

The Top 10 Ruby/Rails Blogs

ubuntu
In my quest to whip my feed reader’s Ruby/Rails related content into shape a bit, I made a little research to find out which Ruby/Rails blogs are the most popular at the moment. I had given up on following most of the blogs systematically a long time ago – it is becoming increasingly hard to keep track of even the aggregators, not to talk about the blogs themselves. There are hundreds of Ruby/Rails blogs out there right now (I am talking about the ones found on the few most popular aggregators – in reality there must be much more of them), so it is clear that you need to pick carefully – unless you happen to be a well-paid, full time Ruby/Rails blog reader (in which case you still would have to crank a lot to do your work properly).

OK, enough nonsense for today – let’s see the results counting down from the 10th place! If you are interested in the method they were created with, or a longer, top 30 list from technorati and alexa, check out this blog entry.

10. http://weblog.jamisbuck.org/ by Jamis Buck.

jamisbuck

Jamis Buck “is a software developer who has the good fortune to be both employed by 37signals and counted among those who maintain the Ruby on Rails web framework”. He is mostly blogging about (surprise, surprise!) Rails – of course on a very high level, which could be expected from a Rails core developer. Very insightful posts on ActiveRecord, Capistrano and other essential Rails topics delivered in a professional way.

9. http://weblog.rubyonrails.org by the Rails core team

weblog_rubyonrails

This is the “default” Ruby on Rails blog, used for announcements, sightings, manuals and whatever else the RoR team finds interesting :-) .

8. http://www.slash7.com by Amy Hoy.

slash7

This is a really cool little site – Amy is a very gifted writer and designer, publishing very insightful articles as well as the nicest (hands down!) cheat sheets about different Web2.0, Ajax, Rails and that sort of stuff. Definitely worth checking out!

7. http://errtheblog.com by PJ Hyett and Chris Wanstrath.

err_the_blog

A very serious blog of two Rails-geeks about advanced topics (but very well explained – so if you are not totally green (#00FF00) you should do fine). Among other things, they have contributed Sexy Migrations to Rails recently.

6. http://nubyonrails.com/ by Geoffrey Grosenbach

nubyonrails

Geoffrey is the author of more than twenty of Rails plugins, (including gruff, my favorite graph drawing gem), a horde of professional-quality articles and the PeepCode screencast site. Do I need to say more?!

5. http://redhanded.hobix.com/ by _why the lucky stiff.

redhanded

_why is probably the most interesting guy in the Ruby community. He is the author of (among tons of other things) Why’s Poignant Guide to Ruby, HPricot, the coolest Ruby HTML parser, Try Ruby! (a must see!) and Hackety Hack, for aspiring wannabe programmers who want to hack like in the movies! The list goes on and on… This guy never stops. If someone will ever invent the perpetuum mobile, he will be it (in Ruby, of course).

4.http://hivelogic.com/ by Dan Benjamin.

hivelogic

Dan’s recent work include Cork’d, a web2.0 wine community site or the A List Apart publishing system. He does great podcasts with various guys.

3. http://mephistoblog.com/ by Rick Olson and Justin Palmer

mephisto

Personally I was quite surprised that a blog concentrating on such a narrow topic (in this case the mephisto blogging system) could grab the 3rd place – so I have checked both alexa and technorati by hand just to be sure, and it seems that everything is OK – mephistoblog is ranked very high on both of them, justifying this position. After all, mephisto is the leading blog system of Rails!

2. http://www.rubyinside.com/ by Peter Cooper.

rubyinside

This blog is my absolute favorite from this top 10 list (actually, from all the Ruby blogs I have encountered so far). I am definitely with Amy Hoy, who said If you had to subscribe to just one Ruby blog, it should be this one. If you would like to know what’s happening in the Ruby/Rails community, rubyinside is the place to check. If there is no new post here, it’s because most probably nothing happened!

And the winner is: http://www.loudthinking.com/ by David Heinemeier Hansson.

loudthinking

Well, what should I add? David is the author of Ruby on Rails, so no wonder his blog topped the list!



Conclusion

It’s interesting to note that nearly all the blogs listed here are mostly pure Rails ones – rubyinside (mixed Ruby/Rails) and redhanded (pure Ruby) being the two exceptions. It would be interesting to generate such a list for Ruby blogs – though I am not sure how. The sources I have used (most notably rubycorner) aggregate both Ruby and Rails blogs) – so it seems there are much more Rails bloggers out there (or they are much better (with the exception of _why) than the Ruby bloggers).

I would really like to hear your opinion on this little experiment – whether you think it makes sense or it is completely off, how could it be improved in the future, what features could be added etc. If I’ll receive some positive feedback, I think I will work on the algorithm a bit more, and run it once in say every 3 months to see what’s happening around the Ruby/Rails blogosphere. Let me know what do you think!


If one is thinking about creating a site for affiliate marketing to earn some extra cash they should thoroughly research a few things. To start with look for a cheap company that sell domains for your domain name registration. Next get a cheap, yet reliable web hosting company to host your site on. These can be easily distinguished as they hire many cisco certified professionals. The generally carry 642-371 certifications. Then look into online backup for your files to avoid data loss. More over perform directory submission to get indexed in the search engines. Getting a+ certificate yourself is not a bad idea since it will help you get through the process with much ease.


Great Ruby on Rails REST resources

REST-cheatsheet
If I had to choose the single most not-really-well-understood, mystified, unsuccessfully demystified, explained and still not-really-grasped topic in the Rails world (and beyond), my vote would definitely go to REST. It seems to me that there are two types of people in the world: those who don’t get REST (and they think it’s a basic postulate to rocket science explained through quantum theory) and those who get it, and don’t understand the former group (unless they are coming from there, that is).

I have been playing around with RESTful Rails recently. Below is my collection or Rails REST howtos, tutorials and other resources I have found so far and which were adequate for my transition from the first group to the second :-) .

You should definitely begin with REST 101 – then check out the other stuff as well!

Please leave a comment if you know some more (just for completeness’ sake – I think the above resources should be enough to grasp RESTful Rails both theoretically and practically.

Creating a site and uploading is considered to be the easy part these days. Especially with languages like ruby on rails you can develop sites in no time. Companies providing hosting services give you a wide variety of options to choose from for your hosting services such as asp hosting or php hosting. Not only that but they also hire 350-040 certified to provide you quality services. Then yahoo hosting provides simple methods for uploading site. With the use of computer backup software you can easily avoid data loss. The actual time consuming part is working on the site’s search engine ranking. Not only does it take time but it is also expensive.

Book Review: Build Your Own Ruby on Rails Applications

Build Your Own Ruby on Rails Applications
Author: Patrik Lenz
Publisher: SitePoint
Pages: 447
Intended Audience: Beginners/Pre-intermediate
Rating: 5/5

I would like to begin with a few words about SitePoint. According to their definition, ‘SitePoint specializes in publishing fun, practical, and easy-to-understand content for web professionals.’. So far I had the pleasure to read three of their books: (obviously) Build Your Own Ruby on Rails Web Applications, The CSS Anthology: 101 Essential Tips, Tricks & Hacks and The Principles of Beautiful Web Design. If I had to judge the publisher based on these three books, I could not agree more: I have found all their claims (fun, practical and easy-to-understand) to be unquestionably true.

After a brief overview of the book I would like to concentrate on the question that popped up in most of you I guess: Why should I prefer this book over Agile Web Development with Rails or other Rails books available? We’ll look into that in a minute, but first things first: let’s see what has Build Your Own Ruby on Rails Web Applications to offer!

The book starts off with installing Ruby, RubyGems, Rails and even MySQL on different operating systems, presented in painstaking detail – which is very good in my opinion, since advanced users will skip this section anyway, and it offers great step-by-step walkthrough for novices.

The second chapter is the compulsory ‘introduction to Ruby’. I have to admit I did not read it – but judging from the contents and a quick skim-through, it offers at least the same knowledge as the other similar Rails-books, which is more than enough to get you started. If you would like to go deeper into both Ruby and Rails, I suggest to check out David A. Black’s excellent Ruby for Rails.

Chapter 4, ‘Rails Revealed’ is the only more-or-less theoretical chapter, discussing the architecture, components and conventions used in Ruby on Rails.

The real action starts from Chapter 5 in the form of building a digg-clone from scratch. You will learn how to build a Rails application, beginning with generating the necessary files and ending up with a nicely working, (relatively) feature rich digg-like site, dealing with user management (even showing an user view with submitted stories), allowing you to submit and vote on stories (just as you would expect from an application like this), sprinkled with a lot of tasty tidbits like tagging (also introducing polymorphic associations in a very easy-to-understand way) or (of course) AJAX.

The book finishes with some advanced topics: Debugging, Testing and Benchmarking, followed by Deploying and Production use, providing instructions to deploy your application on Apache with Mongrel.

If the review would end right now, you could (rightfully) ask: ‘So what? These are exactly the things I would expect from a Rails book’ – and you would be perfectly right. So let’s see why is this book different from all the other ones available on the market!

First of all, it is written in a very understandable and easy-to-digest way: it explains everything as simply as possible, making even the more complicated topics clear right away. I don’t remember reading anything twice, no matter how advanced the topic was. I think this alone makes Build Your Own Ruby on Rails Web Applications one of the best hands-on RoR books today (definitively the best one I have seen so far, but since I did not read all the competitors, I can not unambiguously claim this is the best one).

What I also like about this book is that it does not require nearly any preliminaries at all – the bare minimum that is needed is explained on the side during the application creation, or can be learned from the book.

A big difference compared to Agile Web Development with Rails – which is the de facto Rails book today – is that testing of the created components is described in great detail. The usual workflow is thus problem statement, solution and creating unit tests to verify the code – explaining the why’s and how’s as well. I am not aware of any RoR book currently available that would explain and demonstrate testing your code to this extent.

One could argue that Build Your Own Ruby on Rails Web Applications is not deep enough, which is more-or-less true (compared to e.g. Agile Web Development with Rails) – but I think this is perfectly fine, since going too deep is not the purpose of the book at all! If you need in-depth coverage of Rails internals, would like to go into advanced topics like caching, scaling or deployment in a great detail then this is not the book to get. However, if you would like to try Ruby on Rails right away, without the need to google for blogs helping you to install the preliminaries or get this and that right, be sure to check it out!

Ruby’s Growth Comes to an End?!

According to O’Reilly’s latest report on the state of the computer book market focusing on programming books, Ruby has the definitive lead. Check out this treemap view – I believe it does not need too much additional explanation (The percentages reflect the relative book sale compared to 2006/Q1):



Now, I would not like to start a language war here at all – there is neither a need to draw zealous consequences from the Ruby camp nor to come up with explanation from proponents of other languages. The diagram shows that compared to the same period of 2006, there is the biggest demand for Ruby (and other Ruby-based/related) books currently – and nothing more. It does not tell anything about the number of people using the given language or related frameworks, job opportunities or the absolute market share – this is just a relative indicator based on the programming book market.

However, if you take a peek at the TIOBE index for May – entitled ‘Ruby’s growth comes to an end’ – you can see that Ruby is the fastest growing language at the moment (again, compared to the same period of 2006). If this is the ‘end of the growth’, then how does the growth look like?!

It is also interesting to check out this graph from TIOBE:



It tells me that starting from July 2006, none of the programming languages shows so big (and steady) growth than Ruby.

I don’t know based on what did the TIOBE guys come to the conclusion that Ruby is losing steam… I have talked to a few Ruby on Rails freelancers recently, and each of them confirmed independently that there is a bigger need for Ruby/Rails programmers than ever. Based on (not only) these data I would say quite the opposite is true: my personal feeling is that Ruby/Rails is just going to be a *lot* bigger than it is currently!

Partitioning Sets in Ruby

During hacking on various tasks, I needed to partition a set of elements quite a few times. I have attacked the problem with different homegrown implementations, mostly involving select-ing every element belonging into the same basket in turn. Fortunately I run across divide recently, which does exactly this… No more wheel reinvention! Let’s see a concrete example.

I have an input file like this:

a 53 2 3
b 8 62 1 23
a 9 0 31
b 4 45 4 16 7
b 1 23
c 3 42 2 31 4 6
a 1 3 22
a 7 83 1 23 3
b 1 14 4 15 16 2
c 5 16 2 34

The goal is to sum up all the numbers in rows beginning with the same character (e.g. to sum up all the numbers that are in a row beginning with ‘a’). The result should look like:

[{"a"=>241}, {"b"=>246}, {"c"=>145}]

This is an ideal task for divide! Let’s see one possible solution for the problem:

require 'set'

input = Set.new open('input.txt').readlines.map{|e| e.chomp}
groups = input.divide {|x,y| x.map[0][0] == y.map[0][0] }
#build the array of hashes
p groups.map.inject([]) {|a,g|
   #build the hashes for the number sequences with same letters
    a << g.map.inject(Hash.new(0)) {|h,v|
    #for every sequence, sum the numbers it contains
    h[v[0..0]] += v[2..-1].split(' ').inject(0) {|c,x|
      c+=x.to_i; c}; h
  }; a
}

The output is:

[{"a"=>241}, {"b"=>246}, {"c"=>145}]

Great - it works! Now let's take a look into the code...

The 3rd line loads the lines into a set like this:




The real thing happens on line 4. After it's execution, groups looks like:


, , }>

As you can see, the set is correctly partitioned now - with almost no effort! We did not even need to require an external library...

The rest of the code is out of the scope of this article (everybody is always complaining about the long articles here, so I am trying to keep them short) - and anyway, the remaining snippet is just a bunch of calls to inject. If inject does not feel too natural to you, don't worry - it took me months until I got used to it, and some people (despite of the fact that they fully understand and are able to use it) never reach after it - I guess it's a matter of taste...'

Needle in the Haystack – Information Overloading 2.0

Do you also have the feeling that you are totally drowning under the unbelievable amount of information that is emited by the Web today? (and by other media as well, which emphasizes this greatly, but I would like to focus solely on the Web aspect in this article). I feel more and more frustrated day by day, trying to stay on top of my ever-growing heap of unopened e-mails, undread blog entries, unchecked news sites etc. with a constant fear that though I spend a fair amount of time to consume and process all the information pouring in, I am still missing something very important all the time.

The “problem” is that there are way too many outstanding blogs, aggregators, social new sites, bookmarking service popular links and other sources of information which you “just can not miss”. I fear I am definitely losing the battle – there are more and more information sources, but no new, more effective methods (at least I don’t know about them) to handle them, so I guess it’s pretty clear that as time is progressing, more and more info will fall through the cracks (or spending more and more time will be needed to prevent this).

Since there is no way to stop the exponential growth of information (and if there would be, I doubt anybody would want to utilize it – this is just not the way this problem should be approached), we have to influence the other factor: find more effective means of locating, sorting, bookmarking, processing and otherwise handling the most important data.

It is interesting to observe that at the moment, services with this intention are not really receiving as much attention as they should – provided that the above reasoning is sound and thus there is a need for more effective handling of existing information. Google is a trivial example of this: it has loads of interesting tricks to refine, specify and narrow your search (like for example the synonym operator, ~, or other advanced options) – yet I bet 2 hours of my most precious blog-reading time that most of us can not even tell when did we use advanced search for the last time (besides a few trivial ones entered to the search box, like site:.rubyrailways.com). In most of the cases I never check out more than 2-3 result pages (and just the first page in 90% of the time) – which is interesting, given that I am doing 99% of my searches on google!

In my opinion, exactly the opposite is true: Sites like twitter or tumblelog are immensely popular, flooding you with even more and more information, all the time, every minute, as fast as possible etc. You did not have enough time to read blogs? No problem, here are tumblelogs and twitter messages, which will help you by shooting even more data right into your face much more frequently than ever. Welcome to information overloading 2.0.

Fortunately there is hope on the horizon: some sites are striving to help the situation by providing interesting view on the data, narrowing down the information to a specific niche, or aggregating and presenting it in a way so that you do not have hand-pick it from an enormous set of everything-in-one-bag infosoup. I will try to describe a few of them which I have found interesting recently.

  • Tools utilizing visual representation of data – People are visual beings. In most of the cases, a few good, to-the-point pictures or diagrams can tell much more than boring pages of text. Therefore it is quite intuitive that visual representation of data (typically result of search engine queries) could help to navigate, refine and finally locate relevant results compared to text-only pages.

    ubuntu
    My current favorite in this category is quintura. Besides working as a normal yahoo search, quintura does a lot of other interesting things: it finds related tags to your query and displays them as a tag cloud. You can further refine the search results or navigate to documents found by any of the related tags. Hovering over the related tags displays the related tag for that tag. For example, searching for web scraping, and hovering over the ‘ruby’ related tag, ‘scrubyt’ is also displayed – it would definitely take more time to find scrubyt on google, even by using the search term combination ‘web scraping ruby’ – so the functionality offers more than just a fancy view, it actually speeds up and makes searching faster and more effective.


    quitura in action

    Am I using quintura regularly? Nope. Given that I have just stated a few sentences ago that it can speed up and make searching faster and more effective’ this is strange – but for some reason, if I am after something, I am trying to find it on google.com. This is rather irrational, don’t you think so?

  • Sites concentrating on a specific niche – I feel that (otherwise great) sites like digg are just too overcrowded for me: with over 5000 submissions a day in a lot of diverse categories it simply takes too much time to read even just the front page stories. I am mainly interested in technology and development related articles – and while a lot of digg proponents are arguing that there are both technology and programming categories on digg, they are still too ‘mainstream’ for my taste and rarely catering to a ardcore developer/hacker in my opinion.

    Fortunately dzone and tektag are here to rescue the situation!

    ubuntu
    The guys over at dzone are really cranking all the time to bring a great digg-like site for developers that helps you to stay on top of the current development and technology trends. The community (which is crucial in the case of such a site of course) is really nice and helpful, and in my opinion the site owners have found (and are consantly fine-tuning) the right magic formula to keep the site from being overloaded with redundant information but still delivering the most relevant news and stuff. Currently, dzone is my no 1. source of developer and tech news on the web.

    ubuntu
    In my opinion, tektag did not reach the maturity level of dzone yet (I think they are currently in public beta), but once this will happen, I bet it would be a very important and relevant source of information for developers, too. To put it simple, tektag is to del.icio.us what dzone is to digg. Why is this so great? If you need to bookmark something, you should just use del.icio.us, right? Wrong – at least if you intend to use del.icio.us in any other way than store your personal bookmarks. The problem with del.icio.us again is that people are using to bookmark just anything with it – therefore it is virtually impossible to track the movers and shakers in a narrow topic (like programming). Visiting del.icio.us/popular will show you what’s being bookmarked the most overall, not inside your category of interest (of course I know there are categrories like del.icio.us/popular/programming, but these still do not solve the situation fully by far).

    Tektag has the potential to solve this situation by adding development-specific features and tweaks, but most importantly by the fact that only developer articles will be saved here and thus interpreting the data will me much more easy since the input won’t be cluttered with an enormous amount of information from arbitrary topics. In my opinion the only question of their succes is: can they build the critical user mass?

  • Semantic search – if you hear the word ‘search engine’ most probably google or one of it’s competitors (yahoo, msn) springs to your mind, and you are right – for the absolute majority of the searches, we are using these sites. However, they are not really that powerful in letting you express what are you searching for exactly (and of course, stemming from this fact, actually bring you the right results) because they are not trying to understand the documents on the Web: they just crawl and index them to be searchable with phrases they contain.

    Since the above mentioned sites are still the absolute market leaders in search, It’s clear the keyword based indexing is still good enough(tm) – until somebody will show that there is a more sophisticated way of searching, by trying to apply natural language processing, ontology extraction and other semantic techniques to to actually understand the documents, and deliver usable results with these techniques.

    ubuntu
    Spock, an upcoming people search engine is a great example of this principle in action. Spock’s goal is to crawl the whole web and extract information about people – which is far from trivial – since to do this properly, their spiders have to be smart enough to understand human language as much as possible (A simple example: think of a birth date, e.g. 07/06/05 – is 07 denoting a day (meaning the 7th day in the month) or a year (the year 2007)? There are hundreds, maybe thousands of date formats used on the Web – and there are far more complicated problems to solve than this).

    OK solving complex problems or not, what’s so cool about a people search engine? After all you can use ye good olde google as for everything else. Tim O’Reilly has an excellent example against this approach: on google, it’s trivial to find Eric Schmidt, google’s CEO – however it’s much harder to find the other 44 Eric Schmidts returned by spock. It’s not that google does not find them – but to actually locate them in as much as approximately 4,500,000 returned documents (as opposed to spock’s 45) is nearly impossible.
    Spock is probably the best example in this article to demonstrate how a service should bring you all the information you need – and not even a bit more!

If these services are so promising and they help you to figt the information overloading, thus helping you to find desired information easier (so that you will have more time to read other blogs :-) ), why they are less popular by magnitudes than the ones flooding you all the time? Why do not people use as simple things as advanced google search to fight information overloading? Is information overloading a bad thing at all (since it seems the sites generating the most information with the fastest pace are the most popular)? I can’t really answer these questions at the moment, but even if I could, I have to run now to read some interesting (tumble|b)logs. What!? 20 twitter messages received? Ok, seriously gotta go now…

Getting Beast up and Running on Dreamhost (for the Truly Lazy)

Though dreamhost offers phpBB as one of their one-click install goodies (ergo it is the easiest to install of all forums since you almost don’t have to do anything), I have been looking for something different. To me, phpBB’s interface was always quite unintuitive and too heavy – I wanted something smaller, easier, more compact. The problem was I did not know what should I search for – until I came across beast, a lightweight forum written in Ruby on Rails. It was love at the first sight!

When it comes to tools I am using, I am really language agnostic – this very blog uses WordPress (PHP), I am using Trac (Python) to track my projects, mediaWiki (PHP) is my preferred wiki etc – so even if it may seem so, I did not choose beast because it is written in Rails (although +1 for that :-) ), but because of the design and ease of use. My first thought after trying it was ‘wow, this is as easy to use as a 37signals app’ – it’s really that intuitive and well designed!

Well, this sounds fine and all, but installation on dreamhost was a different story. Thanks God I have found a superb, step by step HOWTO here. However, even after following all the steps, I got ‘incomplete headers’ and other problems, which I have managed to fix – here are some additional comments to the HOWTO:

6. You can forget about this point; as the HOWTO says, it is already installed on DH and it will work without any problems.

7. Forget about ‘development’ and ‘test’, however be sure to get ‘production’ right, as the next step will not work otherwise. It should look something like this:

production:
  adapter: mysql
  database: beast_prod
  host: mysql.myhost.com
  username: us3r
  password: p4ss
  port: 3306

8. For me it worked only *with* the RAILS_ENV=production parameter specified.

9. You can change the salt to anything – it just must not stay the same. The easiest thing is to add or remove a random character from the string.

12. The shebang should be updated to #!/usr/bin/ruby

13. The || should be removed, i.e. it should read:

ENV[‘RAILS_ENV’] = ‘production’

14. Make sure you change the permission of those directories only – I have changed everything recursively, destroying the executable flag of dispatch.fcgi :-) .

Now you should apply the ‘GetText patch’ – it can be found later in the thread. After you should be up and running!

After playing around, I have found that the user listing is not working – fortunately I have found this as well in the forum. The solution is:

app/views/users/index.rhtml line 3 should be modified to

%lt;% form_tag '', :method => 'get' do -%>

Enjoy this great forum!

Problems of Social Bookmarking Today – Part One

Can you imagine the on-line world without del.icio.us, reddit, digg, dzone and other Web2.0 social bookmarking sites? Sure, you can – they were not always around and nobody missed them before they appeared. However, since their debut, I guess no serious geek can exist without them anymore. The functionality and information richness these sites offer is unquestionable – however, there are more and more flaws and problems popping out as people learn to use, monetize, abuse, trick and tweak them. I would like to present my current compilation of woes and worries, sprinkled with a few suggestions on how to handle them.

DISCLAIMER: this is my subjective view on these matters – I am not claiming the things presented here are objectively true – this is just my personal perception.

General Problems

I have read a nice quote recently – unfortunately I can not find it right now. It goes something like this: “Time is nature’s method of preventing things to happen all at once. It does not seem to work lately…”

Though the notion of a social bookmarking site did not even exist when this quote was thought up by someone, it captures the essential problem of these sites very well: too much things are happening all at once, and it is therefore impossible to process the amount of information pouring from everywhere…

  • Information overload – I think this fact is not really a jaw-dropping mind-boggling discovery – but since it is the root of all evil (not just in the context of Web2.0 or social sites, but in general for the whole web today) it deserves to be presented as the first problem in this list. Today it is almost sure that the thing you are looking for is on the Web (whether legally or illegally) – it is a much bigger problem to actually find it! This applies to the social sites as well. A site like digg gets about 5000 article submission every day – and even if you restrict yourself to the front page stories, it is virtually impossible to keep up with them unless you are spending a given (not so short) amount of time every day just with browsing the site. O.K. this is not a Web2.0 or social site problem per se, but a quite hard one to solve nevertheless.


    Proposed solution: I don’t have the foggiest idea :-) Basically an amalgam of the solutions presented in the next points…

  • Articles get pushed down quickly – which is inevitable and not even a terrible problem in itself, since this is how it should work – the worse thing is that the good stuff sinks equally fast as the crap – i.e. every new article hitting the front page makes all the others sink by 1 place.


    Proposed solution: The articles could be weighted (+ points for more votes, more reads, more comments etc, -points for thumbs down, spam report, complaints etc.) and the articles should sink relatively to each other at any given moment – i.e. the weight should be recalculated dynamically all the time and the hottest article should be the most sticky while the least-voted-for should exchange it’s place with the upcoming, more interesting ones.

  • Good place, wrong time – if you submitted a very interesting article, and the right guys did not see it in the right time, it will inevitably sink and never make it to the front page. It is possible that if you would have submitted it half a day later, it would be noted by the critical mass to make it to the front page – the worst thinkg is that you never even know if this is so.


    Proposed solution: Place a digg/dzone/del.icio.us/whatever button after or before the article – this way, people will have the possibility to vote on your article after reading it, no matter how did they get to your site and when. The article will stay on your site forever – whereas on digg it will be present on a relevant place for just a few hours.

  • Url structure problems – sometimes the same document is represented by various URLs which confuses most of the systems. The most frequent manifestations of this
    problems are: URL with and without www (like http://www.rubyrailways.com and http://rubyrailways.com), change of the URL style (from /?p=4 to /2002/4/5/stuff.html) or redirects, among other things.


    Proposed solution: Decide for an URL scheme and use it forever (generally, /?p=4 is not a recommended style – /2002/4/5/post.html and other semantically meaningful URLs are preferred (see Cool URIs never change), set your web server to turn http://www… to http:// (or the other way around)). The sites could also remedy the situation by not just checking the URL, but also the content of the document (like digg does just before submission).

Tagging

Tagging is a great way of describing the meaning of an item (in our case a document) in a concise and easy to understand way – from a good set of tags you should know immediately what is the article about just by reading them. The idea is not really brand new – scientific papers are using this technology for ages (much like PageRank – long time before PageRank was implemented by the google guys, it was an accepted and commonly used technique to rank scientific papers based on the number of their quoting in other relevant works).

Some sites have predefined, finite set of tags (like dzone) while some allow custom ones (like del.icio.us – usually with suggestions based on the tags of others or by extracting keywords from the article). The problem of a predefined tag set is that you are restricted to use only the tags offered by the site – well this is sometimes good because it gives you some guidelines about what is accepted on the site. There are much more interesting problems with sites that allow custom tags:

  • No commonly accepted, uniform tagging conventions – some of these sites are accepting space separated tags, some quoted ones and some of them do not require or recommend any specific format. This is again the source of confusion, even inside the same system. Consider these examples:

    ruby-on-rails
    ruby on rails
    ruby_on_rails
    "ruby on rails"
    RubyOnRails
    ruby rails
    ruby,rails
    ruby+rails
    RUBY-RAILS
    ror
    ROR
    rails
    programming:rails
    

    and I could come up with tons of other ones. The problem is that all these tags are trying to convey the same information – namely that the article is about ruby on rails. Of course this is absolutely clear to any human being – however, much less so for a machine.


    Proposed solution: It would be beneficial to agree on one accepted tagging convention (even if you can not really force people to use it). The sites could use (even more) heuristics to turn tags with the same meaning ito one. For example if the user has a lots of ruby and rails bookmarks, and tags something with ‘rails’ it is very likely that the meaning of the tag is ‘ruby on rails’ etc.

  • Too much tags and no relations between them – I think everybody has, or at least has seen a large del.icio.us bookmark farm. The problem with the tags at this point is that there is a lot of them, and they are presented in a flat structure, without any relation between them. (O.K., there is tag cloud, but it is more of an eye candy in this sense). With a really lot of tags (say hundreds of them) the whole thing can become really cumbersome.


    Proposed solution: Visualization could help a lot here. Check out this image:





    Clustered Tag Graph



    Example of a Clustered Tag Graph





    I think such a representation would make the whole thing easier, mainly if it would be interactive (i.e. if you’d click the tag ‘ActiveRecord’, the graph will change to show the tags related to ‘ActiveRecord’. The idea is that all of your tags should be clustered (where relevant ones should belong to one cluster – the above image is an example of a toread-ruby cluster) and the big graph should consist of the clusters, with each cluster’s main element highlighted for easy navigation. If you click a cluster, it would zoom in etc.

  • Granularity of tagging – this is a minor issue compared to the others, but I would like to see it nevertheless: it should be possible to mark and tag paragraphs or other smaller portions of the document, not just the whole document itself. Imagine a long tutorial primarily about Ruby metaprogramming. Say there is an exceptionally good paragraph on unit testing, which is about 0.1% of the whole text. Therefore it might be wrong to tag it with ‘unit testing’ since it is not about unit testing – however, I would like to be able to capture the outstanding paragraph.


    Proposed solution: Again, visual representation could help very much here. I would present a thumbnail of the page, big enough to make distinguishing of objects (paragraphs, images, tables) possible, but small enough not to be clumsy. Then the user would have the possibility to visually mark the relevant paragraph (with a pen tool), and tag just that.
    This should result is a bookmark tagged like this:





    Granular tagging



    Example of More Granular Tagging





    On lookup, you will see the relevant lines marked and will be able to orient faster.
    To some people this may look an overkill – however, nobody forces you to use it! If you would like to stick with the good-old-tag-one-document method, it’s up to you – however, if you choose to tag up some documents also like this, you have the possibility.

  • Tagging a lot of things with the same tag is the same as tagging with none – consider that you have 500 items tagged with ‘Ruby’. True, you still don’t have to search the whole Web which is much bigger than 500 documents, but still, it is a real PITA to find something in 500 documents.


    Proposed solution: the clustered tag graph could help to navigate – usually you are not looking for just ‘Ruby’ things but ‘Ruby and testing and web scraping’ for example. Advanced search (coming in vol. 2), where you can specify which tags should be looked up and also what should the document contain could remedy the problem, too.

  • Common ontologies, synonyms, typo corrections – O.K. these might seem to be rocket science compared to the other, simpler missing features – however, I think their correct implementation would mean a great leap for the usability of these systems. Take for example web scraping, my present area of interest. People are tagging documents dealing with web scraping with the following tags: web scraping, screen scraping, web mining, web extraction, data extraction, web data extraction, html extraction, html mining, html scraping, scraping, scrape, extract, html data mining – just from the top of my head. I did not think about them really hard – in fact there are much more.
    It could solve much confusion if all these terms would be represented with a common expression – say ‘web scraping’.


    Proposed solution: this is a really hard nut to crack, stemming from the fact that e.g. screen scraping can mean something different to various people. However, a heuristics could lookup all the articles which are tagged with e.g. web scraping – and find the synonyms going through all the articles. It is not really hard to find out that ‘web scraping’ and ‘ruby’ or ‘subversion’ are not synonyms – however, after scanning enough documents, the link between ‘web scraping’ and ‘html scraping’ or ‘web data mining’ should be found by the system. The synonyms could be also exploited by using the clustered tag graph.

Voting

The idea of voting for articles as a mean to get them on the front page (opposed to editor-monitored, closed systems) seemed to be revolutionary and definitely the right way to rank the articles in a people-centered way from the beginning – after all it is really simple: people vote on stuff that they like and find interesting, which is equal to the fact that the most interesting article gets to the front page. Or is it? Let’s examine this a bit…

  • Back to the good old web 1.0 – when Tim O’Reilly coined the term Web2.0 in 2005, he presented a few examples of typical web1.0 vs web2.0 solutions, for example: Britannica Online vs Wikipedia, mp3.com vs napster etc. I wonder why did not he come up with slashdot (content filtered by editors) vs digg (content voted up by people). At that time everybody was soo euphoric about Web2.0 that no one would question this claim (neither did I that time).

    However, it seems to me that after these sites evolved a bit, basically there is not that much difference between the two: according to this article, Top 100 Digg Users Control 56% of Digg’s HomePage Content. So instead of 10-or-something-like-that professionals, 100-or-something-like-that amateurs decide about the content of digg. So where is that enormous difference after all? Wisdom of crowds? Maybe wisdom of a few hundred people. Because of the algorithms used, if you don’t have too much time to submit or digg or comment or look for articles all the time (read: few hours a day) like these top diggers do, your vote won’t count too much anyway. Digg (and I read that also reddit, and possibly sooner or lather this fate awaits more sites (?)) became a place where “Everyone is equal, but some are more equal than others…”.


    Proposed solution: None. I guess I will be attacked by a horde of web2.0-IloveDigg fanatics claiming that this is absolutely untrue and since I have no real proofs of this point (and don’t have time/tools tom make one) I am not going to argue here.

  • Too easy or too hard to get to the front page – The consequence of some of the above points (Information overload, Good place, wrong time, Back to the good old web 1.0) is that if the limit to get to the front page is too high, it is virtually impossible to achieve it (unless you are part of a digg cartell or you have a page which has a lot of traffic anyway + a digg button). However, if the count is too low (hence it is too easy to get to the front page), people might be tempted to trick the system (by creating more accounts and voting on themselves, for example), just to get to the front page – which will result in a lot of low quality sites making it to the front page. Though I don’t own a social bookmarking site, I bet that finding out the right heihgt of the bar is extremely hard – and it even has to change from time to time in response to more and more submissions, SEO tricks etc.


    Proposed solution: A well-balanced mixture of silicon and carbon. Machines can do the most of the job by analysing logs, activities of the user on the page, thumbs up/down received from the user, articles submitted/voted/commented and other types of usage mining. However, machines alone are definitely not enough (since their don’t have the foggiest idea about what’s in an article) – a lot of input is needed from humans, too. On the one side by the users (voting, burying, peer review etc.) and from the editors as well. However I think that this is all done already – and the result is not really unquestionably perfect, I guess mainly because of the information overload – 5000 submissions a day (or 150,000 a month) is very hard to deal with…

  • Votes of experts should count more – In my opinion, it is not right that if a 12 year old script kiddie votes down an article and an expert with 20 years of experience votes it up, their votes are taken into account with an equal weight. OK, I know there is peer review and if the 12 old will do a lot of stupid moves, he will be modded down – so he will open a new account and begin the whole thing again from scratch. On the other hand, the expert maybe does not have time to hang around on digg and similar sites (because he is hacking up the next big thing instead of browsing) and therefore he might not get a lot of recognition from his peers on the given social site – which does show that he is an infrequent digg/dzone/whatever user, but tells nothing about his tech abilities.


    Proposed solution: I think it is too late for this with the existing sites, but I would like to see a community with real tech people, developers, enterpreneurs and hackers of all sorts. How could this be done? Well, people should show what they did so far – their blog, released open source software, mailing list contributions, sites they designed or any other proof that they are also doing something and not just criticizing others (It seems to me that always those people are the most abrasive on-line who do not have a blog, did not hack up somehing relevant or did not prove their abilities in any relevant way). This would ensure also that only one account belongs to one physical person. I know that this may sound too much work to do (both on the site maintainer’s and the users’ side) but it could lay a foundation for a real tech-focused (or xyz-focused) social site . Of course this would not lock out people without any tangible proof of their skills – however they votes would count less.

  • Everything can be hot only once – Most of the articles posted to the social bookmarking sites are ‘seasonal’ (i.e. they are interesting just for a given time period, or in conjunction with something hot at the moment) or news (like announcements, which are interesting for just a few days). On the other hand, there are also articles which are relevant for much longer – maybe months, years or even decades. However, because of the nature of these sites, they are out of luck – they can have their few days of fame only once.

    One could argue that this is good so – however, I am not sure about it. Take for example my popular article on Screen scraping in Ruby/Rails: I am getting a few thousand visitors from google and Wikipedia every month (which proves that the article is still quite relevant) and close to zero from all the social sites, despite of the fact that it was quite hot upon it’s arrival. Moreover, I have updated it since it’s first appearance with actual information, so it is not even the same article anymore, but a newer, more relevant one.


    Proposed solution: Let me demonstrate this on a del.icio.us example, where a certain amount of recent bookmarks is needed to get to the ‘popular’ section (something similar to the notion of the front page on digg-style sites). In my opinion, this count should depend also on the number of already received bookmarks. Let’s see an example: Suppose a brand new article needs 50 recent bookmarks to get to del.icio.us/popular. After getting there and a great stir is created around it, it gets bookmarked 300 times. Then, for the next 50 days it does not receive that much attention, gets 1 bookmark a day on average, so it has 350 votes altogether. However, after these 50 days, for some reason (e.g. some related topic gets hot) 30 people bookmark it in a few hours. In my opinion, it should get popular again – and moreover, with these 30 (and not 50) bookmarks – because it was already popular once. This metric should be than adjusted after getting popular once again – if this happens, and people don’t really bookmark it anymore despite of being featured on /popular, it should get again 50 (or more) votes.
    On digg style pages I would create a ‘sticky’ section for articles that are informative and interesting for a longer timespan. I would add another counter to the article (‘stickiness’) which should be voted up by both editors and users in a similar way as ‘hotness’ is now. Of course it is very subjective what should be sticky – it is easy to know that news are not sticky, but harder to decide this in case of other different material.

Since I never had the chance to try these ideas in practice, I can’t tell if how much (and to what extent) of them would work in real life. I guess there is no better method to find this out than to actually implement these features… and the other ones coming in vol. 2!

In the next part I would like to take a look on the remaining problems, connected with searching and navigation, comments and discussion, the human factor and miscellaneous problems which did not fit into another categories. Suggestions are warmly welcome, so if there will be some interesting ideas, I will try to incorporate those into the next (or this) installment!

Making a website for distance learning about ruby on rails is a great way to create awareness for the language. With the help of online certificate such as ibm certification, which is attained through sitting the ibm exams. With this you can create this site efficiently and with the guidance of oracle certification you can create a strong database for it. Next look around for internet hosting companies to upload the site on. One good example is bluehost, as it hires the best out, such as cisco’s 350-029 certified, there to provide quality services. To ensure that your site gets a good traffic work on search engine marketing. Employ affiliate marketing program to cater a wide scope of audience.


Data Extraction for Web 2.0: Screen Scraping in Ruby/Rails, Episode 1

This article is a follow-up to the quite popular first part on web scraping – well, sort of. The relation is closer to that between Star Wars I and IV – i.e., in chronological order, the 4th comes first. To continue the analogy, probably I am in the same shoes as George Lucas was after creating the original trilogy : the series became immensely popular and there was demand for more – in both quantity and depth.

After I have realized – not exclusively, but also – through the success of the first artcile that there is need for this sort of stuff, I begun to work on the second part. As stated at the end of the previous installment, I wanted to create a demo web scraping application to show some advanced concepts. However, I left out a major coefficient from my future-plan-equation: the power of Ruby.

Basically this web scraping code was my first serious Ruby program: I came to know Ruby just a few weeks earlier, and I have decided to try it out on some real-life problem. After hacking on this app for a few weeks, suddenly a reusable web scraping toolkit – scRUBYt! – begun to materialize which caused a total change of the plan: instead of writing a follow-up, I decided to finish the toolkit and sketch a big picture of the topic as well as placing scRUBYt! inside this frame and illustrating the theoretical things with it described here.

The Big Picture: Web Information Acquisition

The whole art of systematically getting information from the Web is called ‘Web information acquisition’ in the literature. The process consists of 4 parts (see the illustration), which are executed in this order: Information Retrieval (IR), Information Extraction(IE), Information Integration (II) and Information Delivery (ID).

Information Retrieval

Navigate to and download the input documents which are the subject of the next steps. This is probably the most
intuitive step to make – clearly, the information acquisition system has to be pointed to the document which contains the data first, before it can perform the actual extraction.

The absolute majority of the information on the Web resides in the so-called deep web – backend databases and different legacy data stores which are not contained in static web documents. This data is accessible via interaction with web pages (which serve as a frontend to these databases) – by filling and submitting forms, clicking links, stepping through wizards etc. A typical example could be an airpot web page: an airport has all the schedules of the flights they offer in their databases, yet you can access this information only on the fly by submitting a form containing your concrete request.

The opposite of the deep web is the surface web – static pages with a ‘constant’ URL, like the very page you are reading. In such a case, the information retrieval step consist of just downloading the URL. Not a really tough task.

However, as I said two paragraphs earlier, most of the information is stored in the deep web – different actions, like filling input fields, setting checkboxes and radio buttons, clicking links etc. are needed to get to the actual page of interest which can be then downloaded as the result of navigation.

Besides that this is not trivial to do automatically from a programming language just because of the nature of the task, there are a lot of pitfalls along the way, stemming from the fact that the HTTP protocol is stateless: the information provided to a request is lost when making the next request. To remedy this problem, sessions, cookies, authorizations, navigation history and other mechanisms were introduced – so a decent information retrieval module has to take care about these as well.

Fortunately, in Ruby there are packages which are offering exactly this functionality. Probably the most well-known is WWW::Mechanize which is able to automatically navigate through Web pages as a result of interaction (filling forms etc.) while keeping cookies, automatically following redirects and simulating everything else what a real user (or the browser in response to that) would do. Mechanize is awesome – from my perspective it has one major flaw: you can not interact with JavaScript websites. Hopefully this feature will be added soon.

Until that happy day, if someone wants to navigate through JS powered pages, there is a solution: (Fire)Watir. Watir is capable to do similar things as Mechanize (I never did a head-to-head comparison, though it would be interesting) with the added benefit of JavaScript handling.

scRUBYt! comes with a navigation module, which is built upon Mechanize. In the future releases I am planning to add FireWatir, too (just because of the JavaScript issue). scRUBYt! is basically a DSL for web scraping with lot of heavy lifting behind the scenes. Through the real power lies the extraction module, there are some goodies here at the navigation module, too. Let’s see an example!

Goal: Go to amazon.com. Type ‘Ruby’ into the search text field. To narrow down the results, click ‘Books’, then for further narrowing ‘Computers & Internet’ in the left sidebar.

Realization:

  fetch           'http://www.amazon.com/'
  fill_textfield  'field-keywords', 'ruby'
  submit
  click_link      'Books'
  click_link      'Computers & Internet'

Result: This document.

As you can see, scRUBYt’s DSL hides all the implementation details, making the description of the navigation as easy as possible. The result of the above few lines is a document – which is automatically fed into the scraping module, but this is already the topic of the next section.

Information Extraction

I think there is no need to write about why does one need to extract information from the Web today – the ‘how’ is a much more interesting question.

Why is Web extraction such a tedious task? Because the data of interest is stored in HTML documents (after navigating to them, that is), mixed with other stuff like formatting elements, scripts or comments. Because the data is missing any semantic description, a machine has no idea what a web shop record is or how a news article might look like – it just perceives the whole document as a soup of tags and text.

Querying objects in systems which are formally defined and thus understandable for a machine is easy: For instance, if I want to get the first element of an array in Ruby, One can do it easily like this:

my_array.first

Another example for a machine-queryable structure could be an SQL table: to pull out the elements matching the given criteria, all that needs to be done is to execute an SQL query like this:

SELECT name FROM students WHERE age > 25

Now, try to do similar queries for a Web page. For example, suppose that you already navigated to an ebay page by searching for the term ‘Notebook’. Say you would like to execute the following query: ‘give me all the records with price lower than $400′ (and get the results into a data structure of course – not rendered inside your browser, since that works naturally without any problems).

The query was definitely an easy one, yet without implementing a custom script extracting the needed information and saving it to a data structure (or using stuff like scRUBYt! – which does exactly this instead of you) you have no chance to get this information from the source code.

There are ongoing efforts to change this situation – most notably the semantic Web, common ontologies, different Web2.0 technologies like taxonomies, folksonomies, microformats or tagging. The goal of these techniques is to make the documents understandable for machines to eliminate the problems stated above. While there are some promising results in this area already, there is a long way to go until the whole Web will be such a friendly place – my guess is that this will happen around Web88.0 in the optimistic case.

However, at the moment we are only at version 2.0 (at most), so if we would like to scrape a web page for whatever reason *today*, we need to cope with the difficulties we are facing. I wrote an overview on how to do this with the tools available in Ruby (update: there is a new kid on the block – HPricot – which is not mentioned there).

The rough idea of those packages is to parse the Web page source into some meaningful structure (usually a tree) then provide a querying mechanism (like XPaths, CSS selectors or some other tree navigation model). You could think now: ‘A-ha! So actually a web page *can* be turned into something meaningful for machines, and there *is* a formal model to query this structure – so where is the problem described in the previous paragraphs? You just write queries like you would in a case of a database, evaluate them against the tree or whatever and you are done’.

The problem is that the machine’s understanding of the page and human thinking about querying this information are entirely different, and there is no formal model (yet) to eliminate this discrepancy. Humans want to scrape ‘websop records with Canon cameras with maximal price $1000′, while the machine sees this as ‘the third <td> tag inside the eight <tr> tag inside the fifth <table> … (lot of other tags) inside the <body>> tag inside the <html> tag, where the text of the seventh <td> tag contains the string ‘Canon’ and the text of the ninth <td>, is not bigger than 1000 (to even get the value 1000 you have to use a regular expression or something to get rid of the most probably present currency symbol and other possible additional information).

So why is this so easy with a database? Because the data stored in there has a formal model (specified by the CREATE TABLE keyword). Both you and the computer know *exactly* how a Student or a Camera looks like, and both of you are speaking the same language (most probably an SQL dialect).

This is totally different in the case of a Web page. A web shop record, a camera detail page or a news item can look just anyhow and your only chance to find out for the concrete Web page of interest is to exploit it’s structure. This is a very tedious task on it’s own (as I have said earlier, a Web page is a mess of real data, formatting, scripts, stylesheet information…). Moreover there are further problems: for example, a web shop record must not be uniform even inside the same page – certain records can miss some cells which others have, may containt the information on a detail page, while others not and vice versa – so in some cases, identifying a data model is impossible or very complicated – and I did not even talk about scraping the records yet!

So what could be the solution?

Intuitively, there is a need for an interpreter which understands the human query and translates it to XPaths (or any querying mechanism a machine understands). This is more or less what scRUBYt! does. Let me explain how – it will be the easiest through a concrete example.

Suppose you would like to monitor stock information on finance.yahoo.com! This is how I would do it with scRUBYt!:

#Navigate to the page
fetch ‘http://finance.yahoo.com/’

#Grab the data!
stockinfo do
symbol ‘Dow’
value ’31.16′
end

output:

  <root>
    <stockinfo>
      <symbol>Dow</symbol>
      <value>31.16</value>
    </stockinfo>
    <stockinfo>
      <symbol>Nasdaq</symbol>
      <value>4.95</value>
    </stockinfo>
    <stockinfo>
      <symbol>S&P 500</symbol>
      <value>2.89</value>
    </stockinfo>
    <stockinfo>
      <symbol>10-Yr Bond</symbol>
      <value>0.0100</value>
    </stockinfo>
  </root>

Explanation: I think the navigation step does not require any further explanation – we fetched the page of interest and fed it into the scraping module.

The scraping part is more interesting at the moment. Two things happened here: we have defined a hierarchical structure of the output data (like we would define an object – we are scraping StockInfos which have Symbol and Value fields, or children), and showed scRUBYt! what to look for on the page in order to fill the defined structure with relevant data.

How did I know I had to specify ‘Dow’ and ’31.16′ to get these nice results? Well, by manually pointing my browser to ‘http://finance.yahoo.com/’, and observing an example of the stuff I wanted to scrape – and leave the rest to scRUBYt!. What actually happens under the hood is that scRUBYt! finds the XPath of these examples, figures out how to extract the similar ones and arranges the data nicely into a result XML (well, there is much more going on, but this is the rough idea). If anyone is interested, I can explain this in a further post.

You could think now ‘O.K., this is very nice and all, but you have been talking about *monitoring* and I don’t really see how – the value 31.16 will change sooner or later and then you have to go to the page and re-specify the example again – I would not call this monitoring’.

Great observation. It’s true scRUBYt! would not be of much use if the situation of changing examples would not be handled (unless you would like to get the data only once, that is) – fortunately, the situation is dealt with in a powerful way!

Once you run the extractor and you think the data it scrapes is correct, you can export it. Let’s see how the exported finances.yahoo.com extractor looks like:

#Navigate to the page
fetch ‘http://finance.yahoo.com/’

#Construct the wrapper
stockinfo “/html/body/div/div/div/div/div/div/table/tbody/tr” do
symbol “/td[1]/a[1]”
value “/td[3]/span[1]/b[1]”
end

As you can see, there are no concrete examples any more – the system generalized the information and now you can use this extractor to scrape the information automatically whenever – until the moment the guys at yahoo change the structure of the page – which fortunately not happening every other day. In this case the extractor should be regenerated with up-to date examples (in the future I am planning to add automatic regeneration in such cases) and the fun can begin from the start once again.

This example just scratched the surface of what scRUBYt is capable of – there are tons of advanced stuff to fine-tune the scraping process and get the data you need. If you are interested, check out http://scrubyt.org for more information!

Conclusion

The first two steps of information acquisition (retrieval and extraction) are dealing with the question ‘How to get the data I am interested in (querying)’. Up to the present version (0.2.0) scRUBYt! contains just these two steps – however, to do even these properly, I will need a lot of testing, feedback, bug fixing, stabilization, adding heaps of new features and enhancements – because as you have seen, web scraping is not a straightforward thing to do at all.

The last two steps (integration and delivery) are addressing the question ‘what to do with the data once it is collected, and how to do that (orchestration)’. These facets will be covered in a next installment – most probably when scRUBYt! will contain these features as well.

If you liked this article and you are interested in web scraping in practice, be sure to install scRUBYt! and check out the community page for further instructions – the site is just taking off, so there is not too much yet – but hopefully enough to get you started. I am counting on your feedback, suggestions, bug reports, extractors you have created etc. to enhance both scrubyt.org and scRUBYt! user experience in general. Be sure to share your experience and opinion!

To launch a tutorial site is comparatively much easier today than it was a few years ago. You can easily buy domain name at a very low cost and do domain parking until your site is ready. Get a good business hosting package from one of the many providers listed on the internet, go for a company which hires people with cisco certifications such as 642-143. Create a professional web design with the help of adobe. Get online training that can guide you through the site’s development. Use your laptop wireless internet connection to upload from anywhere conveniently.

Web 2.0 Tutorial

First of all, I have to make a disappointing confession: this is not a Web 2.0 tutorial – but fear not, at least the logical and absolutely valid question to this dilemma (i.e. why the hell is the article entitled ‘Web 2.0 tutorial’ then?) will be provided.

Although this blog’s tagline is ‘Ruby, Rails, Web2.0′ and I am blogging/planning to blog about all these topics in the future, I did not have an exclusively-and-only-about-Web2.0 post yet (as far as I remember). That’s why it strikes me odd that according to google analytics, a lot of people are finding this site via the keyword combination ‘Web2.0 tutorial’. This post was inspired by them and for them!

Since this trend is nearly as old as this blog – and it seems to continue, and even rise as time goes by – I am now really curious what the heck are people imagining behind the term ‘Web2.0 tutorial’. Why? Well, there are more reasons to ponder about:

- Nobody knows what Web 2.0 actually is (or if does, the others don’t agree :-) ). Since coined by Tim O’Reilly back in 2005, ‘Web 2.0′ has been redefined, argued about, glorified, despised, parodied, upgraded to Web 3.0, regarded as vapor, bubble etc. (and who knows what else…) countless times – just one thing did not happen: A commonly accepted, concise (or even lengthy) definition with which everybody would agree.
You won’t find anybody interested in the Web today who would not have his own definition associated with Web2.0 – however, these definitions (although more overlapping and similar than ever) will be varying from person to person.

- The conjunction itself is kind of absurd – even if we accept that there is a common understanding of the term ‘Web2.0′, it definitely has more facets: Look (Apple aqua reinvented, round corners galore, reflections of reflections etc), social aspect (digg, del.icio.us, youTube, myspace et al), theoretical backend (ontologies, folksonomies, openAPIs, microformats, mashups etc), standards (XHTML (2.0! :-) ), RDF, FOAF, ATOM, SVG, SOAP), innovative ways of communication and catering to the users (WS, REST, Podcasts, Videocasts), typical Web2.0-purpose pages (wikis, blogs), development tools and frameworks (AJAX, Ruby on Rails, …) and other buzzwords :-)

- Even if we define Web2.0 as a collection of the things from the previous point, the term ‘Web 2.0 tutorial’ is too broad-sense to get you too much relevant results (I believe – maybe some smart webmasters engaged in the ways of SEO tricking found out the carving after a Web2.0 tutorial already and wrote up a few for you :-) ). Just as someone would not search a ‘programming language tutorial’ (but a ‘Ruby tutorial’ instead) or a ‘sport tutorial’ (rather a ‘squash tutorial’), searching after a *real* ‘Web2.0 tutorial’ could be ineffective, too. I suggest to look for ’rounded corners tutorial’, ‘mashup tutorial’ or ‘Ruby on Rails tutorial’ etc. instead. Additionally, if you are really keen on Web2.0-ness of these documents, don’t forget to add ‘Web2.0′ to the query – just in case :-) .

- Related to the previous point: attack the problem from bottom up rather than the other way around – i.e. try to look for solutions of concrete problems and assemble them into a Web2.0 style whatever once you are done, rather than trying to do something which is Web2.0 in the first place. In my opinion you should think like ‘I would like to create a great mashup in Ruby on Rails with AJAX and a Web2.0 look – how should I go about this?’ rather than ‘Let’s see a good Web 2.0 tutorial and then I will cook something great’. You should strive for creating great looking websites with great content and functionality, and people will like it and use it – whether you call it Web2.0, Web3.0 or whatever – even if the URL of the site will be www.thissiteisnotweb2.0.com :-) .

Now that I have mentioned ‘Web2.0′ and ‘Web 2.0 tutorial’ more times in this article, I guess I’ll be receiving even more hits through this query – though this was definitely not the reason for writing this article. However, if you already got this far, please take a few seconds and share with us your thoughts on this. After all Web2.0 is also about collaboration, you know. Heck, I might even write a few Web2.0 tutorials in the future – just tell me what a ‘Web2.0 tutorial’ means… :-) .

2006 rubyrailways.com Retrospective

For the sake of future comparison, out of plain fun and for just whatever else, here are some statistics of my first about-half-a-year of blogging:

Global Statistics

  1. 1,057,638 successful requests for an average of approximately 4000 requests/day
  2. 622,776 page views for an average of approximately 2300 page views/day
  3. 34 posts and 364 comments, contained within 15 categories. This statistically means a post gets about 11 comments on average
  4. Data transferred: 10.54 GB, which is a daily average of approximately 40 MB
  5. Current AdSense CPM: 2.04$ (is this good or bad? It is hard to get such info on the net…)

Content

  1. Most popular post (i.e. most page hits): Data extraction for Web 2.0: Screen scraping in Ruby/Rails (nearly 10.000 reads)
  2. Most debated/controversial post (i.e. most comments): Sometimes less is more (45 comments)
  3. Most referenced article: Install Internet Explorer on Ubuntu Dapper in 3 easy steps (9 references)
  4. Best runner-up: Implementing ‘15 Exercises for Learning a new Programming Language’

Platforms

  1. 57% Windows (quite surprising for a site where the most popular search terms were ‘ubuntu ruby rails’ and ‘dapper ruby install’ :-) )
  2. 27% Linux
  3. 16% Mac

Browsers

  1. 74% Firefox & Mozilla
  2. 14% Internet Explorer (83% IE 6.0, 16% IE 7.0)
  3. 7% Safari
  4. 3% Opera

Top 5 referring sources

  1. google.com
  2. direct
  3. stumbleupon.com
  4. dzone.com
  5. del.icio.us

Given that rubyrailways.com is my first attempt at blogging, I am studying Ruby for just a few months now (I even started this blog earlier than I wrote my first Ruby script), I have really little time for blogging and that I am not a native speaker, these figures are not that bad I guess :-) . Of course I would like to improve them even more, so please leave a comment with suggestions on this – what would you like to see here in 2007?

Book Review: Ruby Cookbook

Since I am relatively new to Ruby, I have no idea how life could have been in the dark ages of the non-Japanese-speaking Ruby community (1995 – 2000), when there was no English Ruby book on the market. The ice was broken by Andy Hunt and Dave Thomas with a pickaxe – err… actually the Pickaxe (a.k.a. “Programming Ruby”), which has undoubtedly become an all-famous Ruby-classic since then.

In the foreword, Matz, the author of Ruby, explains that since he is much better in coding than in documentation writing, probably the authors did not have an easy job – what they could not find in the (rather scant) documentation, had to figure out directly from the Ruby source code.

The Ruby book scene looks radically different today. In fact we are facing the opposite problem: there are so much books on Ruby that sometimes it can be hard to choose which ones to read and in which order. Probably it won’t be any easier to find the answers for these questions in the future: judging from the blogs and announcements, the bigger part of the books is yet to come. If you are new to Ruby you will most probably have a hard time figure out how to spend your money wisely [1] – so what’s the solution?

Of course there is no definitive answer for this question – I can only tell you what worked for me.

First I would definitely recommend David A. Black’s Ruby for Rails [2]. It is absolutely suited for newcomers (and for advanced hackers, too), no matter if you are new to Ruby and/or coming from a different programming language [3]. I was also a Python enthusiast (through doing most of my everyday work in Java) when I have discovered Ruby – and David’s book was a perfect choice to switch very fast.

Currently I am undecided between the 2nd and the 3rd place, so let’s say you should check them out in parallel – They are (of course) the pickaxe and Hal Fulton’s “The Ruby Way”. They are both time-tested Ruby classics, hence a must read. However, if you have time and/or money to read only one of the above books, in my opinion it should be “Ruby for Rails”.

Although these three masterpieces are – in my opinion – among the most well-written and informative tech books available today, you have to remember the good old rule: No matter how much books you read or how good they are – you will never become a true Ruby hacker until you actually begin to use the acquired knowledge and put it into practice.

After reading these books I wanted to jump into writing some cool stuff – Ruby seemed to be so elegant, easy, succinct – and to my greatest surprise, I could not write too much sensible code :-) (at least not without referring to these books and/or google and/or ruby-talk more frequently that I considered o.k. to call it programming on my own).

This is exactly the situation where the Ruby Cookbook should enter the scene. The first three books give you a hint
about *what* can be done with Ruby[4]. The cookbook offers you well organized content in forms of recipes to show you *how* it can be done elegantly, quickly and effectively in a ruby-esque way.

Probably the most frequent answer to the question ‘How should I improve my Ruby skills’ on the ruby-talk mailing list sounds: By starting your own project. Since I put this advice into practice myself and it worked for me, I have to agree: armed with the goodies from Learning Ruby, The Pickaxe and the Ruby way, the best thing to do is to grab a copy of the Cookbook and jump into your own project. When I started my one, a web extraction framework, I had no idea about documenting Ruby code, packaging the whole program into a gem, logging, writing unit tests (in Ruby) and automatizing these tasks (and a lot of other things – this post would be considerably longer if I would like to state everything). However, with the Ruby Cookbook by my side, learning and putting things into practice from writing the first line until packaging the whole framework into a gem was a piece of cake.

If you are unfamiliar with the O’Reilly cookbook series format, it is a set of ‘recipes’ (problem statement, solution, discussion) divided into categories (like Strings, Arrays, Hashes… in this case) for easy lookup of the problem at hand. While it would be possible and certainly edifying to read the book cover to cover from the start (in this case you should also consider that it has 873 pages), I found that it really shines when you are stuck with a problem: you search for the relevant category and the relevant problem, apply the solution, read the discussion to understand what’s going on under the hood, rinse, repeat and after the 3rd or so cycle you will find out that you are not reaching for the book anymore (at least not because of this problem).

OK, time to take a more detailed look at the content.

I would divide the book into five categories: Essentials, Ruby Specific Constructs, Advanced Techniques, Internet and networking and Software Management/Distribution. I will review them one by one briefly.

  • Essentials include Strings, Numbers, Arrays, Hashes, Date and Time, Files and Directories. For a beginner Ruby journeyman, these chapters are a real gold mine. Though the cookbook is not really intended for total beginners (it assumes a fair amount of Ruby knowledge), it certainly would not be impossible for a skilled (non-Ruby) programmer to understand most of the recipes since they are going from simple to complicated (e.g. the String chapter begins with concatenating strings and closes
    with showing off text classification with a Bayesian classificator).

    In this category I have probably learned the most Ruby best-practices from the chapters Arrays and Hashes [5]. As a constant lurker on the ruby-talk mailing list, I have had some hard time figuring out all those inject()s and collect()s and each_slice()s and each_cons()s and other enumerator/iterator things – when I have thought I already understood them, somebody came with an even more complicated example and I was not so sure once again – until the moment I bought the book, that is.

    The cookbook is very good at eliminating these vague and wobbly things like I had: you will not only understand what’s going on, but actually get comfortable using the idioms so typical for Ruby. That’s so great about it.

  • Ruby Specific Constructs featuring Objects and Classes, Modules and Namespaces, and Reflection and Metaprogramming. Every newcomer to Ruby encounters the wonders that (not exclusively but most characteristically) make the language so beautiful: code blocks, closures, mixins, the vast possibilities offered by metaprogramming and reflection just to mention some of them. This chapter is written exactly to examine and discuss these constructs.

    While probably I learned the most new things from this section, I have to say that I have been missing a meta-level here: The chapters (especially about metaprogramming) presented a lot of fancy LEGO bricks but did not show how to build a Statue of Liberty or Eiffel tower out of them (well, not even a simple medieval castle in my opinion :-) . Of course this does not need to be a problem – metaprogramming techniques should have a book on their own, and anyway a cookbook is not intended to solve concrete problems but rather reoccurring/frequent ones. Probably I am just too curious about the ways of the meta :-) .

    To sum it up, this and the previous section (Essentials) together helped to beef up my rubyish programming style by an enormous magnitude in the practice – nearly all information you need is there in the other books as well, but reading them does not make you comfortable with these techniques.

  • Advanced Techniques include XML and HTML, Graphics and Other File Formats, Databases and Persistence, Multitasking and Multithreading, User Interface, Extending Ruby with Other Languages, and System Administration. I was kind of unsure about this category – pairing UI with databases or system administration for example seemed odd for the first glance – but since I did not want to create even more categories, I have decided to put everything here which did not fit into the other ones, thus it can be viewed as a ‘miscellaneous’ section as well.

    I would like to review two chapters here – HTML/XML and Databases and Persistence since these are the closest to my field of expertise and I also believe these two were the most deep in this category. Again, this does not mean that the other chapters were not good, but in my opinion they just scratched the surface compared to above two.

    The HTML/XML chapter really has it all: parsing, validating, transforming, extracting data from XML documents, encoding and XPath handling to highlight some interesting topics. The coverage is surprisingly thorough for a language which is promoting YAML (Yaml Ain’t Markup Language) over XML. The HTML chapters, though there is just a few of them, are also very useful:-downloading content from Web pages, extracting data from HTML, converting plain text to HTML and vice versa. My only concern here is that I missed some third party package coverage (like RedCloth, BlueCloth, Hpricot or Mechanize) – but this is really nitpicking: if the author would take all my wishes into account, the book would have several thousand pages :-)

    Databases and Persistence starts off with serialization recipes (using YAML, Marhsal and Madeleine). Chapters on indexing unstructured as well as structured text (SimpleSearch, Ferret) are a pleasant surprise before the must-have topics take off: connecting and using different kinds of databases (MySQL, PostgreSQL, Berkley DB)
    as well as Object Relational Mapping frameworks (Rails ActiveRecord and Nitro Og) and doing every kind of SQL voodoo magic of course. What should I add? Probably nothing.[6]

    I would really like to write something about the other chapters in this category, too, but since I am constantly bashed for the length of my posts, just believe me that they are great as well :-) .

  • Internet and networking consists of Web Services and Distributed Programming, Internet Services and (surprise! surprise!) Web Development: Ruby on Rails. It would be really a cliché to write about why and how much the Internet is so important nowadays, how much Web 2.0 rocks, SOA and WS and REST and FOO and BAR rules etc. so I won’t do that ;-) . However, it is a fact that Web application development never mattered this much in the history – so these chapters were basically compulsory.

    I would divide the category into two subcategories – Internet/Web stuff and distributed programming.

    There is really not too much to add to the first category – there is an unbelievable amount of information crammed into two chapters: ‘abstract’ techniques (HTTP headers and requests, DNS lookup etc), using every kind of protocols (HTTP(s), POP, IMAP, FTP, telnet, SSH…), servlet, client/server and CGI programming as well as talking to Web APIs (amazon, flickr, google) and Web services of course (XML-RPC, SOAP). In my opinion, the category offers more than enough information to get started and/or explore advanced techniques.

    It’s a shame that Distributed Programming got the half of a chapter only – O.K., I admit I am somewhat inclined to these techniques and they are maybe not used by that much people. The action is revolving mostly around DrB and Rinda, with an exception of 2 MemcCached recipes. The chapter closes with a nice ‘putting things together’ recipe by creating a remote-controlled Jukebox.

    I did not get too deep into the Ruby on Rails chapter, since I read Agile Web Development with Rails as well as Ruby for Rails and a lot of much more advanced Rails stuff previously – but judging from the recipe titles and skimming through some of them, the chapter looks very informative and unquestionably helpful if you have had no prior experience with Rails.

  • Last but not least, Managing and Distributing Software includes Testing, Debugging, Optimizing, and Documenting, Packaging and Distributing Software and Automating Tasks with Rake. If you plan to use Ruby for any other task than system administration (or writing very short scripts/one liners for whatever reason), documenting, testing, debugging and automating tasks is absolutely crucial. I know that lot of coders does not like to hear this – since they want to code and not write tests, documentation etc. – but I think nowadays, a serious programmer, no matter how much she would like to concentrate on hacking up feature MyNextCoolStuffWhichWillShakeTheEarth has to master these things. In the long run, any software that is undocumented, tested and continuously refactored will turn into Spaghetti quite easily.

    That said, these chapters were excellent for me. I have experience with these tasks in Java – however, the toolset is radically different in some cases (like Ant vs. Rake) and even if it is similar (Unit tests, rdoc vs. JavaDoc) the re-learning of them was inevitable. Fortunately, with the help of these recipes it was a breeze to learn them in Ruby (well, I have to add that actually these things (as nearly everything else) are considerably easier to do in Ruby, so the ease of learning stems from this fact as well).

    Rake absolutely rocks. Maybe I am also concerned because I have been working with Apache Ant a lot – well, if the ratio between Ruby and Java code is say 1:10, then the ratio between Rake and Ant files is 1:50 if we also consider simplicity, maintainability and understandability.

    Finally, if you also plan to release your software, the chapter Managing and Distributing Software can come handy. I think if you would like to distribute your stuff to the masses, packaging it into a gem is inevitable – rubygems are so cool that they made Rubyists too lazy to download something from a site instead of launching ‘gem my_cool_software’.

Conclusion



If you would like to become a serious Ruby hacker, don’t hesitate to buy this book. In my opinion it is absolutely worth every cent – and even more. My only problem is that there are no more recipes – however this is not a critique but rather a compliment: you simply can not get enough – not even from nearly 900 pages. One could argue that some things are missing or he would rather see this instead of that (I believe the authors themselves have had some tough time deciding these matters) – but I guess everyone agrees that the material which made it to the book is absolutely top-notch. 5 out of 5 stars – a great addition to anyone’s Ruby bookshelf.

Notes


[1] It is absolutely possible to learn Ruby withouth spending a nickel – there are excellent Ruby tutorials out there, like Why’s poignant guide to Ruby ( with cartoon foxes and chunky bacon :-) ) or the first edition of the Pickaxe book which is available online for free, or Learning Ruby by Satish Talim, and a lot of other ones, too. For some beginner ruby exercises you can also check out my earlier post: 15 exercises for learning a new programming language – or just use google…Back


[2] I am not sure whether it was the best move to include ‘Rails’ in the title – it may turn down some who would like to learn Ruby but not Rails. However, I can
assure you that this book is a true Ruby masterpiece. Though there are some interesting Rails techniques included, the primary focus is unquestionably Ruby. Back


[3] There is one possible exception: If you are new not only to Ruby but also to programming, you should probably check out Chris Pine’s Learning to program first. Back


[4] Of course there will be always some overlapping and not every book can be absolutely correctly categorized in every case (for example, the Ruby Way has also a cookbook-like chapters) Back


[5] Of course this does not mean that the rest of the chapters were not that helpful – just coming from Python, I did not have so much ‘wow’ moments. Nevertheless,
they also teach a lot of idioms and are in no way less informative than the other two. Back


[6] Devil’s advocate(tm) says: maybe some chapters on SQLite and Oracle, as well as advanced SQL stuff would be cool – however, this is really mega-über nitpicking since then the title should be ‘Ruby and SQL cookbook’ :-) Back