Friday, September 16, 2011

Some random thoughts on Hadoop, HBase and OSS

Let me start with the link - its very interesting piece of information:  Goggle sorts petaflop.  First of all, I must confess - I am Hadoop programmer. Hadoop is open-source alternative to Google proprietary MapReduce framework (this is what they used to sort petabyte). This is my day-to-day job - do some stuff in Hadoop and HBase (distributed k-v data store inspired by Google's BigTable). I am wondering how much will it take to sort petabyte in Hadoop? The last number I am aware about: 973 minutes on 3600+ node cluster in 2009. 30x times slower. Of course, average server in 2009 can not be compared to average server in 2011 and number of servers in a cluster were more that 2x time less. How long would it take to finish the same benchmark on the same 8000 nodes cluster but with Apache Hadoop instead of Google's proprietary MapReduce framework? I would say we can divide 973 minutes by 4 ~ 240 minutes to get some approximate estimates. Its ~ 8 times slower than Goggle can do. So what? I must confess one  more time - I do not believe in OSS (Open Source Software) as a good model for EVERY type of a software. When you need: 
  1. Performance.
  2. Optimal (minimal) system resource usage.
  3. Robustness.
  4. Predictable release schedule.
  5. Innovation.
You better look for commercial alternatives or develop this software in-house (if you have budget, time and skilled professionals).