Designed to scale (and fly :): 2011

Friday, September 16, 2011

Some random thoughts on Hadoop, HBase and OSS

Let me start with the link - its very interesting piece of information: Goggle sorts petaflop. First of all, I must confess - I am Hadoop programmer. Hadoop is open-source alternative to Google proprietary MapReduce framework (this is what they used to sort petabyte). This is my day-to-day job - do some stuff in Hadoop and HBase (distributed k-v data store inspired by Google's BigTable). I am wondering how much will it take to sort petabyte in Hadoop? The last number I am aware about: 973 minutes on 3600+ node cluster in 2009. 30x times slower. Of course, average server in 2009 can not be compared to average server in 2011 and number of servers in a cluster were more that 2x time less. How long would it take to finish the same benchmark on the same 8000 nodes cluster but with Apache Hadoop instead of Google's proprietary MapReduce framework? I would say we can divide 973 minutes by 4 ~ 240 minutes to get some approximate estimates. Its ~ 8 times slower than Goggle can do. So what? I must confess one more time - I do not believe in OSS (Open Source Software) as a good model for EVERY type of a software. When you need:

Performance.
Optimal (minimal) system resource usage.
Robustness.
Predictable release schedule.
Innovation.

You better look for commercial alternatives or develop this software in-house (if you have budget, time and skilled professionals).

Thursday, August 18, 2011

Update on Koda

Fixed one issue with native memory allocator used internally by Koda which greatly affected multithreaded update performance. The preliminary test numbers are ~ 4M queries per sec with 90/10 read/write ratio (which is up 15%). I think for 50/50 ratio performance gain must be much larger.
Implemented persistence layer support (which is based on a very promising leveldb library). The support is not integrated yet (will be working on this later on next week)
Currently working on compression support (snappy and gzip) for Koda.

My plans for early autumn 2011: Koda will be released as a part of a new open source distributed Key-Value data store. Wait for announcements. We will be in the same league with Cassandra and HBase.

Tuesday, August 2, 2011

How to become rich and very rich?

Befriend US government. Read this passage carefully:

The long-term goal is for SAP to be seen as a provider of insight – cheap and fast, in near real time. Of course, that seems like the goal of every old-line hardware and software giant who is trying to move to the world of cloud computing. McDermott has his own proof points, including a job the company did for Recovery.gov that enables citizens to quickly identify which portions of U.S. geography received what portion of $800 billion in economic stimulus spending. “Traditional vendors said ‘We can do it in a couple of years, for a couple hundred million dollars,” McDermott says. “We did it for single-digit millions.”

That is it. "Single digit millions" for web-site back-end which can be built by a group of oDesk developers for couple months and with budget << 50K.

Wednesday, July 13, 2011

Koda: disposable means reusable

This is one feature I forgot to mention in my previous post. Koda supports disposable caches. What is disposable cache? OK, imagine that you need to create cache in memory to keep several millions (or billions if you have enough RAM :) entries, process entries and then discard (dispose) them. Imagine if you need to do this many times per second (in other words - very often). Releasing cache entries explicitly means a huge impact on native memory allocator and take some time (its probably several millions de-allocations per second). Instead of deallocating (releasing) cache entries explicitly Koda put them into so called reuse pool and they can be re-used (realloc'ed ) later on for new cache entries. This allows to dispose (re-use) caches almost instantly. Nice feature by the way and it has some very interesting applications.

PS I love Google. I do not promote my blog at all but if you try search "infinispan benchmark" you will get my blog on the first page in a blog category.
PPS The same is true for "Terracotta BigMemory benchmark" search. First page in a Blog category.

Tuesday, July 5, 2011

What is Koda (in one sentence)?

Key-code-value in memory data store. Here is the promised feature set:

1. Scalable off-heap storage allows using the whole server’s RAM. Tested up to 30G of memory.

2. Very low and predictable latencies. With cache size = 30G: Avg latency = 3.5 microsecond, Median = 2.5 microsecond. 99% - 25 microseconds, 99.9% = 50 microseconds, 99.99% < 300 microseconds.

3. Max latencies are in 100’s ms even for very large caches.

4. Very fast (up to 7x times faster than BigMemory). 3.5M requests per second (90/10 get/put) with eviction = LRU and cache size = 28G.

5. Very fast (faster 4 x times than on-heap Ehcache ).

6. Hard limit on a maximum cache instance size (in bytes) and hard limit on overall allocated memory.

7. Low overhead per cache entry (24 bytes).

8. Compact String representations in memory.

9. Fast serialization using Kryo

10. Custom serialization through Writable, Serializer interface.

11. Multiple eviction policies: LRU, FIFO, LFU, RANDOM.

12. Two types of Secondary cache indexes (optimized for scan and optimized for lookup).

13. Very low index memory overhead (< 10 bytes per cache entry).

14. Indexes are lightweight (creation performance: 10-20M of cache entries per second).

15. Indexes are off-heap as well. So you do not have to worry about Java OOM.

16. Execute/ExecuteForUpdate API allows to execute Java code inside server process thus allowing more efficient data processing algorithms (no need to lock-load-store-unlock data on a client side).

17. Direct Memory Access allows to implement rich off heap data structures such as: lists, sets, trees etc.

18. Query API.

19. Queries are performed on serialized version of objects (without materializing objects themselves, using DMA).

20. Queries are very fast (scan of 10s of millions cache entries per second per server w/o indexes).

21. Supported OS: Linux, FreeBSD, NetBSD, Mac OSX, Solaris 10,11. 64 bit only.

22. Infinispan 5.0 integration.

23. Ehcache integration.

Sunday, July 3, 2011

Introducing Koda - Key-Code-Value in memory data store

OK , its hard to begin. I have been working on Koda for the last six-seven months. This is my spare time project and now I think its time to introduce Koda to the general public. Its better to begin with some benchmarks to give you impression how fast is it and what you can do with Koda. But before I continue with some benchmark results I want to talk a little bit about Java garbage collection issues on large and extra large heaps (> 6-8GB).

If you are Java programmer as myself you probably know that GC issues (read - long pauses) can ruin any server application which try to utilize more than 4-8GB of heap memory. Some people think that there are some magic combinations of HotSpot configuration parameters which can minimize GC pauses and even make them negligible. In some cases (read - types of server app load) it can help to some extent in many others - it can't. There is no magic bullet yet. Therefore, some companies have decided to go the unusual way : hide allocated memory from Java GC. Meet Terracotta's BigMemory. The idea was pretty simple: passivate Java objects into off heap memory thus hiding them from garbage collector. This can help keep Java heap (and GC pauses) small even if we cache gigabytes of data. I have run some tests on with cache sizes up to 30G and can tell you that it works. You will get BigMemory performance numbers as well later on. But there is one downside with BigMemory - performance (requests throughput). It is not on par with pure Java cache (which keeps all objects in heap memory and does not do any serialization/de-serialization voodoo). Despite this limitation there are definitely some use cases which require deterministic, predictable GC behavior and BigMemory will find its customers.

The idea behind Koda was the same: to keep Java objects in off heap memory. The project has started as attempt to build something similar to BigMemory, but better (faster). Now I can say, that Koda has much more features than BigMemory and some of them are quite unique ones (I will describe them in my next posts). Ok, lets get to the business - performance benchmark.

Benchmark description

This is a multi-threaded (number of threads is 16) put/get key-value pairs into the Java cache simple use scenario .Read/write ratio is 90/10. The key size is approx 20 bytes, value size varies between 500-1000 bytes. Both keys and values are byte arrays. Cache eviction policy is LRU for all caches. Test measures: Total request throughput, maximum request latency (maximum GC pause duration), average latency, mean latency and number of percentiles: 99%, 99.9%, 99.99%. All measurements are done after 30 minutes of test execution.

Our contestants

Ehcache 2.4.2 (enterprise edition)
Infinispan 5.0 RC6
Ehcache 2.4.2 + BigMemory
Koda (sorry, no link yet)
Infinispan 5.0 CR6 + Koda

The number 5 is BigMemory for Infinispan implemented as a custom DataContainer (which is Koda). I have replaced default Infinispan DataContainer implementation with Koda-based (which stores cache entries in off heap memory similar to BigMemory).

Hardware/Software

2x Intel Xeon (8 CPU cores), 32G RAM

OS: RHEL 5.5

Java: 1.6.23

Ehcache 2.4.2 (enterprise edition)

HotSpot options: java -server -Xms28G -Xmx28G -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:SurvivorRatio=16

RPS = 1.1M - requests per second

Max latency = 65 sec - longest GC pause duration

Avg latency = 0.014 ms (14 microseconds)

Mean latency = 0.002 ms (2 microseconds)

99% latency < 0.084 ms

99.9% latency < 0.120 ms

99.99% latency < 4.3 ms

Number of cache items: 30M

Ehcache 2.4.2 (enterprise edition) + BigMemory

HotSpot options: java -server -XX:MaxDirectMemorySize=28G -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:NewSize=64M -XX:SurvivorRatio=16

RPS = 0.5 M - requests per second

Max latency = 0.5 sec - longest GC pause duration

Avg latency = 0.030 ms (30 microseconds)

Mean latency = 0.011 ms (11 microseconds)

99% latency < 0.064 ms

99.9% latency < 4.4 ms

99.99% latency < 13.5 ms

Infinispan 5.0 RC6

HotSpot options: java -server -Xms28G -Xmx28G -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:SurvivorRatio=16

Infinispan has failed to produce any meaningful results. It ran very well until it reached maximum number of cache items them it stuck forever. I just killed the process when avg number of requests per second dropped below 100K.

Koda

HotSpot options: java -server

I allocated 28G of memory for off heap cache.

RPS = 3.5 M - requests per second

Max latency = 0.11 sec - longest GC pause duration. This is 110 ms

Avg latency = 0.0045 ms (4.5 microseconds)

Mean latency = 0.0035 ms (3.5 microseconds)

99% latency < 0.020 ms

99.9% latency < 0.043 ms

99.99% latency < 0.5 ms

Number of cache items: 42M

Designed to scale (and fly :)