Partitioning Sets in Ruby

During hacking on various tasks, I needed to partition a set of elements quite a few times. I have attacked the problem with different homegrown implementations, mostly involving select-ing every element belonging into the same basket in turn. Fortunately I run across divide recently, which does exactly this… No more wheel reinvention! Let’s see a concrete example.

I have an input file like this:

a 53 2 3
b 8 62 1 23
a 9 0 31
b 4 45 4 16 7
b 1 23
c 3 42 2 31 4 6
a 1 3 22
a 7 83 1 23 3
b 1 14 4 15 16 2
c 5 16 2 34

The goal is to sum up all the numbers in rows beginning with the same character (e.g. to sum up all the numbers that are in a row beginning with ‘a’). The result should look like:

[{"a"=>241}, {"b"=>246}, {"c"=>145}]

This is an ideal task for divide! Let’s see one possible solution for the problem:

require 'set'

input = Set.new open('input.txt').readlines.map{|e| e.chomp}
groups = input.divide {|x,y| x.map[0][0] == y.map[0][0] }
#build the array of hashes
p groups.map.inject([]) {|a,g|
   #build the hashes for the number sequences with same letters
    a << g.map.inject(Hash.new(0)) {|h,v|
    #for every sequence, sum the numbers it contains
    h[v[0..0]] += v[2..-1].split(' ').inject(0) {|c,x|
      c+=x.to_i; c}; h
  }; a
}

The output is:

[{"a"=>241}, {"b"=>246}, {"c"=>145}]

Great - it works! Now let's take a look into the code...

The 3rd line loads the lines into a set like this:




The real thing happens on line 4. After it's execution, groups looks like:


, , }>

As you can see, the set is correctly partitioned now - with almost no effort! We did not even need to require an external library...

The rest of the code is out of the scope of this article (everybody is always complaining about the long articles here, so I am trying to keep them short) - and anyway, the remaining snippet is just a bunch of calls to inject. If inject does not feel too natural to you, don't worry - it took me months until I got used to it, and some people (despite of the fact that they fully understand and are able to use it) never reach after it - I guess it's a matter of taste...'

Needle in the Haystack – Information Overloading 2.0

Do you also have the feeling that you are totally drowning under the unbelievable amount of information that is emited by the Web today? (and by other media as well, which emphasizes this greatly, but I would like to focus solely on the Web aspect in this article). I feel more and more frustrated day by day, trying to stay on top of my ever-growing heap of unopened e-mails, undread blog entries, unchecked news sites etc. with a constant fear that though I spend a fair amount of time to consume and process all the information pouring in, I am still missing something very important all the time.

The “problem” is that there are way too many outstanding blogs, aggregators, social new sites, bookmarking service popular links and other sources of information which you “just can not miss”. I fear I am definitely losing the battle – there are more and more information sources, but no new, more effective methods (at least I don’t know about them) to handle them, so I guess it’s pretty clear that as time is progressing, more and more info will fall through the cracks (or spending more and more time will be needed to prevent this).

Since there is no way to stop the exponential growth of information (and if there would be, I doubt anybody would want to utilize it – this is just not the way this problem should be approached), we have to influence the other factor: find more effective means of locating, sorting, bookmarking, processing and otherwise handling the most important data.

It is interesting to observe that at the moment, services with this intention are not really receiving as much attention as they should – provided that the above reasoning is sound and thus there is a need for more effective handling of existing information. Google is a trivial example of this: it has loads of interesting tricks to refine, specify and narrow your search (like for example the synonym operator, ~, or other advanced options) – yet I bet 2 hours of my most precious blog-reading time that most of us can not even tell when did we use advanced search for the last time (besides a few trivial ones entered to the search box, like site:.rubyrailways.com). In most of the cases I never check out more than 2-3 result pages (and just the first page in 90% of the time) – which is interesting, given that I am doing 99% of my searches on google!

In my opinion, exactly the opposite is true: Sites like twitter or tumblelog are immensely popular, flooding you with even more and more information, all the time, every minute, as fast as possible etc. You did not have enough time to read blogs? No problem, here are tumblelogs and twitter messages, which will help you by shooting even more data right into your face much more frequently than ever. Welcome to information overloading 2.0.

Fortunately there is hope on the horizon: some sites are striving to help the situation by providing interesting view on the data, narrowing down the information to a specific niche, or aggregating and presenting it in a way so that you do not have hand-pick it from an enormous set of everything-in-one-bag infosoup. I will try to describe a few of them which I have found interesting recently.

  • Tools utilizing visual representation of data – People are visual beings. In most of the cases, a few good, to-the-point pictures or diagrams can tell much more than boring pages of text. Therefore it is quite intuitive that visual representation of data (typically result of search engine queries) could help to navigate, refine and finally locate relevant results compared to text-only pages.

    ubuntu
    My current favorite in this category is quintura. Besides working as a normal yahoo search, quintura does a lot of other interesting things: it finds related tags to your query and displays them as a tag cloud. You can further refine the search results or navigate to documents found by any of the related tags. Hovering over the related tags displays the related tag for that tag. For example, searching for web scraping, and hovering over the ‘ruby’ related tag, ‘scrubyt’ is also displayed – it would definitely take more time to find scrubyt on google, even by using the search term combination ‘web scraping ruby’ – so the functionality offers more than just a fancy view, it actually speeds up and makes searching faster and more effective.


    quitura in action

    Am I using quintura regularly? Nope. Given that I have just stated a few sentences ago that it can speed up and make searching faster and more effective’ this is strange – but for some reason, if I am after something, I am trying to find it on google.com. This is rather irrational, don’t you think so?

  • Sites concentrating on a specific niche – I feel that (otherwise great) sites like digg are just too overcrowded for me: with over 5000 submissions a day in a lot of diverse categories it simply takes too much time to read even just the front page stories. I am mainly interested in technology and development related articles – and while a lot of digg proponents are arguing that there are both technology and programming categories on digg, they are still too ‘mainstream’ for my taste and rarely catering to a ardcore developer/hacker in my opinion.

    Fortunately dzone and tektag are here to rescue the situation!

    ubuntu
    The guys over at dzone are really cranking all the time to bring a great digg-like site for developers that helps you to stay on top of the current development and technology trends. The community (which is crucial in the case of such a site of course) is really nice and helpful, and in my opinion the site owners have found (and are consantly fine-tuning) the right magic formula to keep the site from being overloaded with redundant information but still delivering the most relevant news and stuff. Currently, dzone is my no 1. source of developer and tech news on the web.

    ubuntu
    In my opinion, tektag did not reach the maturity level of dzone yet (I think they are currently in public beta), but once this will happen, I bet it would be a very important and relevant source of information for developers, too. To put it simple, tektag is to del.icio.us what dzone is to digg. Why is this so great? If you need to bookmark something, you should just use del.icio.us, right? Wrong – at least if you intend to use del.icio.us in any other way than store your personal bookmarks. The problem with del.icio.us again is that people are using to bookmark just anything with it – therefore it is virtually impossible to track the movers and shakers in a narrow topic (like programming). Visiting del.icio.us/popular will show you what’s being bookmarked the most overall, not inside your category of interest (of course I know there are categrories like del.icio.us/popular/programming, but these still do not solve the situation fully by far).

    Tektag has the potential to solve this situation by adding development-specific features and tweaks, but most importantly by the fact that only developer articles will be saved here and thus interpreting the data will me much more easy since the input won’t be cluttered with an enormous amount of information from arbitrary topics. In my opinion the only question of their succes is: can they build the critical user mass?

  • Semantic search – if you hear the word ‘search engine’ most probably google or one of it’s competitors (yahoo, msn) springs to your mind, and you are right – for the absolute majority of the searches, we are using these sites. However, they are not really that powerful in letting you express what are you searching for exactly (and of course, stemming from this fact, actually bring you the right results) because they are not trying to understand the documents on the Web: they just crawl and index them to be searchable with phrases they contain.

    Since the above mentioned sites are still the absolute market leaders in search, It’s clear the keyword based indexing is still good enough(tm) – until somebody will show that there is a more sophisticated way of searching, by trying to apply natural language processing, ontology extraction and other semantic techniques to to actually understand the documents, and deliver usable results with these techniques.

    ubuntu
    Spock, an upcoming people search engine is a great example of this principle in action. Spock’s goal is to crawl the whole web and extract information about people – which is far from trivial – since to do this properly, their spiders have to be smart enough to understand human language as much as possible (A simple example: think of a birth date, e.g. 07/06/05 – is 07 denoting a day (meaning the 7th day in the month) or a year (the year 2007)? There are hundreds, maybe thousands of date formats used on the Web – and there are far more complicated problems to solve than this).

    OK solving complex problems or not, what’s so cool about a people search engine? After all you can use ye good olde google as for everything else. Tim O’Reilly has an excellent example against this approach: on google, it’s trivial to find Eric Schmidt, google’s CEO – however it’s much harder to find the other 44 Eric Schmidts returned by spock. It’s not that google does not find them – but to actually locate them in as much as approximately 4,500,000 returned documents (as opposed to spock’s 45) is nearly impossible.
    Spock is probably the best example in this article to demonstrate how a service should bring you all the information you need – and not even a bit more!

If these services are so promising and they help you to figt the information overloading, thus helping you to find desired information easier (so that you will have more time to read other blogs :-) ), why they are less popular by magnitudes than the ones flooding you all the time? Why do not people use as simple things as advanced google search to fight information overloading? Is information overloading a bad thing at all (since it seems the sites generating the most information with the fastest pace are the most popular)? I can’t really answer these questions at the moment, but even if I could, I have to run now to read some interesting (tumble|b)logs. What!? 20 twitter messages received? Ok, seriously gotta go now…