Hi i came back after long time. This time again with java. Recenly playing with lot of java solutions relating to scaling and clustering. Lot of places java high performacne systems you see buzzwords like multi-threading, clustering, caching, map-reduce, partitioning, grid solutions. Lets discuss some of them.
I have been using hibernate for last 1.5 year in my personal experiments and at work place too. In one of project we had cache requirement for master data which was read-only. There we had implemented caching by hand as follows.
1. Wrapper around DAOs which first check in cache and then go to database.
2. Cache was implemented with Websphere provided object pooling API.
3. Websphere provided object pooling mechanism which even works in clustered set-up or network deployment with cache replication strategies.
Our requirement was well met and caching brought lot of improvement in response time as expected.
After caching, horizontal scaling or clustering comes into my mind when i think about performance.
Since then, I discovered lot of caching/clustering projects/apis, some of which are very much i feel worth to take a note.
1. Tangosol - recently acquired by Oracle, along with oracle, timesten ( oracle in-memory database, another acquisition :-) ) forms formidable data-tier. Tangasol is very proven prodcut which provides lot of features like distributed cache, data partitioning. Cameron Purdy on Theserverside claims to reach upto 0.5 milllion cache transaction per second.
2. Terracotta :- Terracotta is very revolutionary java product in fact listed in top 10 java things of year. Terracotta actually speaking is not caching solution but a clustering solution. Its basically JVM level clustering. They have provided cache solutions for lot of commons requirements : Hibernate, HTTP Session, Spring Beans etc. They provide good extension for lot of open-source projects. I had tested tomcat clustering and terracotta, similar to one on terracotta site and it gives good performance boost. Among all other clustering solutions terracotta has simplest programming model : NO PROGRAMMING MODEL. right : objects basically clustered at jvm level so only configuration change no code change. Practically you do require to change java code but that's minimal All java semantics work well in terracotta cluster. Terracotta is very useful in patterns like Master/Worker or bunch processes looking for some coordination or data sharing. Master Worker pattern term i guess was introduced much before Googles Map/Reduce and very much the same idea.
3. Memcached : this one is in c but has client libraries in almost all major programming languages. Idea of memcached is actually little bit different. You keep bunch of memcached process running objects are stored on these processes with hash key. Memcached works best when its processes run on web servers which are cpu-intensive with lot of memory available.
4. New bunch of grind frameworks : gigaspaces, hadoop - java map-reduce implementation
But now i had come across very different requirement where cache modification (writes) were significant in numbers and jdbc-batching with asynchronous operation was key to performance sometime this is called - write behind. When i looked upon only tangosol claims to have such facility so i decided to implement by hand. Here is an idea.
1. A background threads basically monitors queue.
2. All cache write operations append modified objects to this queue.
3. queue periodically flushes them to database and notifies successful operation to clients.
Here catch is how do you handle object graph updates
1. Delta Calculation
2. New Object insertion may require updates in one specific order where result of first inserts basically required for next one. (foreign key)
Initially i went with hibernate where i used StatelessSession as SQL generation engine hoping that hibernate will calculate delta properly. But lack of L1 cache means no life cycle hence no delta monitoring. So i had two choices.
1. Use merge : Fires extra selects
2. Generate sql by hand.
With help of cglib proxies i managed to get list of dirty columns but then how do you handle object relationships?
I soon realized whole ORM needs to be implemented... here list of requirements for basic ORM cum cache with write-behind
1. easy configuration : declarative orm mapping sufficient for most of scenarios
2. Should provide different execution strategies timer driven, threadpool driven, batch reached, batch and timer combined, entity level strategy for sync
3. in-memory multi-indexing based on columns or own implementation : cache object can be queried with different criteria
4. notification - listeners
5. target rate with % of write operation : 500 requests/second.
5. recovery test : check points : automated and manual
6. clustering with help of terracotta.
This is exhaustive list .. where first one itself is very big one..... my take on it is to restrict the scope and focus on caching instead of fancy orm stuff the one like hibernate.... it should be able to support only object graph ... all lifecycle is left to user..
I will add code snippets in coming posts ... now only idea has been finalized.