Saturday, August 22, 2009

Querying Java Objects stored in Terracotta's NAM Part 3

First and Second part of this series I talked about Querying data structures available in Java. First part specifically talked about existing ones like JoSQL, JxPath and Quaere and discussed indexing problem. Second part specifically talked about lucene and jofti as indexing frameworks and wrote small framework to test and get some performance numbers. This test showed that B-tree index (memory and possibly disk based) is suitable and not lucene index. Third part of this series now I am discussing my own small framework- tcquerymap which is rewrite of Jofti. Jofti is old and not maintained and uses jdk14 libraries. Now even documentation link does not work : http://prism-index.com/guide.html.

querymap documentation link : Getting Started With Querymap
querymap source download : Complete Source Code with Eclipse Project
querymap TIM : Copy this TIM to terracotta modules directory to use it
querymap terracotta sample app : Download sample eclipse App. You can use Terracotta eclipse plugin to launch it within eclipse

So the basic idea is maintaining in-memory indexes which index String Ids with "Comparable" as Keys. All primitive wrappers in Java implement this interface so no special conversion is needed.


How it works

It scans java objects and makes comparable objects for each of the property mentioned to be indexed. It inserts these comparable objects in its own tree against String Ids assigned to the java object. So in-effect its nothing more than maintaining many in-memory Maps. In fact the implementation that I wrote uses JDK TreeMap and not b-tree map. But in-future it will be replaced by more performant B-tree.

Querying
One needs to understand that with in-memory object indexes its only possible to implement subset of SQL query and no join queries. To start with, small framework only implements following operations : arithmetic operands : <,>,<=,>=,==. Logical operands : AND , OR. Range : BETWEEN, Set : IN

To implement querying SQL parser needs to be implemented. I choose to avoid writing parser and implemented direct API kind of querying similar to Quaere. I find Domain specific languages more intuitive since when developer writes code for it he know what querying he writes. So interface looks like below. It is not as good as quaere. Quaere is DSL, below API is just few interfaces implemented. Here execute returns List of IDs put into index against the properties.


import static com.google.code.querymap.ObjectQuery.*;

Collection col = from(Domain.class).where(
gt("inner.property2", 60),
lt("inner.property2",89)
eq("inner.property3",random.nextInt(NUM_OBJECTs))
).execute();



So basically it is equivalent of follwing SQL

select from Domain
where inner.property2 > 60
and inner.property2 < property3=rand(NUM_OBJECTS)


How it performs
Naturally is not as performant as Jofti since it used JDK tree map which uses red-black tree. In future when I complete writing my own b-tree implementation I expect to perform as good as Jofti. Jofti further implemented node level locking so multiple concurrent insert opertations can work parellel. This can also be implemented too. But query performance is not bad and I expect it to improve with b-tree implementation.


Integration with Terracotta

Since it uses TreeMap it is cluster-able easily. Attached tc-config.xml has all correct declarations for it. One more advantage is with Terracotta is that object identifier are readily generated by Terracotta. See implementation TerracottaQueryMap. Please dont compare performance of TerracottaQueryMap against HashMap, CHM or Terracotta Distributed Map(Concurrent String Map old name) since all these are just single index maps so easy to stripe or employ mulitple locks.

For using it as TIM you need to add following lines to tc-config.xml


<modules>
<module id="com.google.code" name="querymap-1.0" version="1.0.0">
</module>


In code you can use it as queryable map as follows. Here propList is list of properties to be indexed.


TerracottaQueryMap map = new TerracottaQueryMap(Domain.class,proplist));

map.put(key,domain1);
map.put(key,domain2);
map.put(key,domain3);


To Query.


Collection col =map.entrySet(
from(Domain.class).
where(
eq("inner.property2",random.nextInt(NUM_OBJECTs))
)
);


Future directions planned
Java has hugh limitation for memory intensive application. So achieve scale two approaches : partition index and merge results or use disk to overflow index pages. Other thing I see can be implemented is that when query selects random elements these random elements need to be faulted from Terracotta server thus degrading performance, same elements can be read from local store too like EHCACHE. Later on this topic separately.

If you think this framework is useful please let me know. You can download its source as Eclipse project(sole dependency on tc.jar) and as Terracotta Integration Module here.

Tushar

Friday, July 10, 2009

Erlang and Concurrency

Here I write a lot about my experiments with Terracotta which is shared durable memory for Java applications which are multi-threaded. But some days before I came across erlang, language which is created at Erricsson for running fault-tolerant telecom applications. Whats so special about it?. If you google or rather "Bing" it you will find plenty of things about it. How cool and scalable it is?. If you read this you will even come to know How Goldman Sachs was using it to gain significant advantage in program trading over other competitors and how others are trying to steal leaked source code.

So erlang is not procedural programming language its functional programming language. Frankly I also need to understand whats so different about it. But I saw this presentation of infoQ site about erlang concurrency and was amazed. Right from starting I always thought about following graph - throughput increases as a function of in-coming request rate till some point but after it stabilizes and then it drops. It drops because of system overload. In a perfectly cpu-intensive lock-contention free application this will happen because of cpu context switching.


But in erlang it stays constant instead your latency (response time) increases. This is according to Little's Law. Little law says relation between throughput and latency is number of users in system. N = RX.




Above is famous graph of benchmark of YAWS (Http server written in erlang) against Apache and you can see how early apache gets saturated and dies. You can read details here but explanation is given here is :

"The problem with Apache is not related to the Apache code per se but is due to the manner in which the underlying operating system (Linux) implements concurrency. We believe that any system implemented using operating system threads and processes would exhibit similar performance. Erlang does not make use of the underlying OS's threads and processes for managing its own process pool and thus does not suffer from these limitations."


So basically all magic is erlang's concurrency model : No Shared State, Only message passing between light-weight processes. Erlang processes are way lighter than Java Threads since they are logical entities and not tied to user-level or kernel-threads. Thus erlang shows "No Shared State" concurrency model scales well. Since now JVM is touted as platform, I am looking forward to see erlang implementation on JVM and see how it does against other concurrent interpreted languages - scala. This is great post why JVM is unfit for such porting. May be Java 8 ( I think closures are not part of Java 7). This is also interesting read about Erlang on Java : Erlang Concurrency model on JVM . Some of work on writing OTP(erlang's sdk for writing applications) for scala http://github.com/jboner/scala-otp/tree/master/.

I have already got Programming Erlang book now looking forward to write first program in OTP.

Tushar

Thursday, July 9, 2009

Links : Java Sample Apps

Many times you hear or read something cool about some framework or tool and you are interested in Sample application written for it just to play with it, browse and go through source-code to find out how to code it.

Here is list of great sample Apps that I just stumbled upon while reading this great blog about Tomcat Clustering.

Link : http://www.mediafire.com/jbs-blog-examples

List is
* IBM-DB2-JDBC-XML.zip
* IBM-DB2-JDBC-Relational.zip
* Google-App-Engine.zip
* WebServices-JAX-WS-Java-SE.zip
* RESTful-WebServices-Apache-CXF.zip
* WebServices-Apache-CXF-Spring-2.5.zip
* WebServices-Apache-CXF.zip
* WebServices-WSIT-Reliable-Messaging.zip
* WebServices-JAX-WS-Web-App-Client-Basic-Security.zip
* WebServices-JAX-WS-Web-App-Client.zip
* WebServices-JAX-WS-Web-App.zip
* PMD-Clover2-Cobertura-Maven2-Test.zip
* WebServices-JAX-WS.zip
* WebServices-Axis2-with-Eclipse-client.zip
* Spring-Maven2-annotations-example.zip
* Spring-Maven2-basic-example.zip

I will add following to above lists which I know about

Terracotta Samples Application written by Team

Monday, June 29, 2009

Terracotta's Hibernate Integration

This post is re-post of my earlier write-behind post but in different perspective : Terracotta's Hibernate integration 3.1

With version 3.1 Terracotta has implemented its own Caching for Hibernate Second Level Caching Provider. Earlier Terracotta's hibernate integration approach was : clustering EHCACHE. Terracotta with its JVM clustering ability, it was easily possible to cluster any POJO structure. So before 3.1, you might have used EHCACHE as hibernate second level cahce provider and tim-hibernate and tim-ehcache for clustering second level cache. With version 3.1 onwards terracotta will have its own cache backed by map-evictor and concurrent string map. Apart from this new hibernate integration has lots of new additions like cache admin console and read-write cache. Cache is always up-to-date and coherent.

But what I feel is that Terracotta platform is way more capable and following additional features can be added to make applications more scalable. These are just cool ideas.

Cache Warm-up feature
It would be nice feature to refresh or load cache whenever application or application cluster is starting up. This can easily be implemented with some sort of CacheLoader interface where Terracotta can callback this interface when faulting cache objects from terracotta server during first access. But such warm-up is only required on full cluster restart otherwise lot of meaningful cache entrites will get overwritten.

Write-Behind Caching
When you think of cache you will arrive at these cache strategies : Read-Through Caching, Write-Through Caching, Write-Behind Caching. Hibernate Second Level cache is Read-Write-Through Cache where if cache miss occurs, entity is read from database and then handed over to cache for susequent access. But H2LC is not Write-Behind caching. With Terracotta's disk persistence and asynchronsous module it would be really efficient for certain use-cases to implement write-behind. Currently hibernate just directly writes to database. Instead if its modified to write to second level cache and persistent async-database-queue, this would decrease latency and increase throughput dramatically. Imagine if you can schedule all your database writes in non-business hours using tim-async. I find write-behind is certainly the best way to reduce pressure on database. And with Terracotta's clusterwide coherent persistent datastore its practicaly possible. Terracotta would be your database guard taking all your querying as well as database inserts on its shoulders.
But this model would require certain changes in the way hibernate works. especially query cache. Since now Terracotta will have latest snapshot of yor System of Record, queries have to be executed against cache and not database. Thus it can not be generic solution. You can implemented write-behind only in certain cases where your business use case permits it. On the other hand to solve query problem Querymap that i disucssed in my previous posts can be used to query certain type of data. So if your business use case permits write-behind and query-map can give you very fast database accelerator. In one of my previous jobs I was working on financial application where certain set of objects were modified at very high rate and same were queried against. For such application classic replicated H2LC does not bring any value, instead it will degrade the performance due to overhead during frequent-cluster-wide updates. But Terracotta will make it scalable, forwarding updates only to Node on which cache entry exists, updating the object clusterwide so when AsyncProcessor picks it up it will contain all the changes made. Its Terracotta's DSO Magic.

Advantage here is that you dont have to do religious shift of Killing Your Database Totally. Database is your System of Record. With Terracotta Hibernate Accerlerator you are only delaying updates to SOR and not replacing it.

Currently I am going through Hibernate source code and learning how hiberante event mechanism works. My guess is that write-behind can be implemented with hibernate events. If not I may try to modify the source code to add write-behind and h2lc-cache querying capability. Hibernate search is similar where instead of classic session you get Indexing-aware session.

With Terracotta FX (assuming your application requires more than 4000 write operations per second - avg throughput of one un-tuned Terracotta server) your write throughput will increase linearly which is not possible with any RDBMS on any type of hardware.

I hope Terracotta will add these features in coming versions. Terracotta 3.1 Hibernate Integration is just start.

Monday, May 25, 2009

Querying Java Objects stored in Terracotta's NAM Part 2


First part of this series, I talked about existing frameworks and what I found out is that they lack indexing hence rarely useful for large data-sets. So I tried finding out how to do indexing. My idea was simple : index objects and store reference to object in index then with Terracotta you can cluster objects and index as well. So it becomes "queryable" datastore. My first attempt was to find out how indexing is done. By book it says b-tree index. I found out this(jdbm) framework which is trying to do the persistent DB in Java by implementing B-tree indexes on disk.I took only b tree and implemented simple query parser. What it does is that it traverses b-tree and finds out tuples and then returns them.



After this first attempt, then I experimented with Lucene. Lucene is not tree-index, its inverted index but it has lot of capability and its fast, supports in-memory and disk-based indexes.



So here is my little framework for queryable datastore :



public interface TCQueryMap extends Map{
void init();
Map query(String query);
}




Naturally its extension of Map which is single index. My implementation wraps a HashMap with ReadWrite locks and Lucene RAMDirectory index. So all get/puts hit index within lock boundaries and then you can query index. This is very simple, I have not gone into complexities like spill-over of index onto disk etc.


LuceneIndexingConfig config = new LuceneIndexingConfig();
List propList = new ArrayList();
// index three properties only
propList.add("accountName");
propList.add("person.age");
propList.add("person.name.firstName");

config.setIndexPropertyList(propList);

TCQueryMap indexer = new LuceneQueryStore(config);

// add object
Acccount account .....
indexer.put("user99",account);

// query object
Map col = indexer.query("person.age:21");




I also came to know about Jofti from one of comments. Jotfi is what I would eventually like to write. I don't know why its not used by many people. One reason could be its not maintained. I found it pretty useful so I plugged in Jofti as well in my framework.


Now lets compare it with simple Hibernate-JDBC based solution. Obviously its not perfect. SQL is way more complex and expressive language. But here we are talking about cached data and I am sure once data is cached ( it means objectified from join query on relational DB) very few times you will require join, its mostly "where clause" of one or more conditions. Lucene does that very well.


So lets see numbers. I have not done any tuning apart from standard lucene stuff. One of main parameters is how many properties you want to index. This determines index size, memory and speed.


Below is small benchmark showing 60K objects inserts with three properties indexed and then random queries on three properties

Lucene Inserts/Sec = 1000
Jofti Inserts/Sec = 8793.78
Lucene Queries/Sec = 5172
Jofti Queries/Sec =13636.36



Results for 14 properties indexed :

Lucene Inserts/Sec = 740
Jofti Inserts/Sec = 4866.96
Lucene Queries/Sec = 3750
Jofti Queries/Sec = 12500



Since Jofti is Tree index it outperforms Lucene Index. The problem with Lucene is that once index gets bigger insert performance slows down. Also these numbers are taken with one commit on one put operation. If you index lot of objects together and then commit, lucene is also fast, that's how it is to be used - Batch API. On the other hand Jotfi is fast, I could not find any details about being thread-safe and other concurrency issues so I wrapped it around Lock. I don't know why Jofti is not used by many people.


Also what if you can run Hibernate/JPA queries on Map? that would be great. Its already done by hibernate team. They run query against Second level cache but it would big task to find out and extract idea out of it. Just a thought. Second Level Query cache gets invalidated when you modify single entity, imagine if we update the same object in QueryMap cache you dont need Query cache of course querying capability is not great.


Another thought that comes in my mind is clustering in-memory databases like H2 or HSQLDB. Imagine the benefits of it. But then its anti-terracotta. Why? it would be Relational DB with baggage of ORM mismatch.


Entire source code you can download it from here : http://code.google.com/tc-querymap/. Tar file is just bunch of java files and its very early
prototype. Stay tuned to project, I will update it once I finish with proper integration with Terracotta.



So if you find it useful please leave comments, I would love to hear from you.

Monday, May 4, 2009

Got one!!

Finally I got my own I Love Linux T-shirt, here is snap.
You too can create your own T-shirt with Geeky quotes or any other quotes or picture. I am planning to make another with Ubuntu but waiting/searching for good image apart from standard "Linux for Human Beings"



Wednesday, April 15, 2009

Portable Ubuntu Rocks


Years ago (literally 2.5 years ago) I had tried co-linux. At that time it was in initial stages but was working perfectly in text mode. If you don't know whats co-linux, its linux distribution which works like Windows Binary. No need to setup Virtaul Machine Emualator and install or add virtual images. This was when I had never heard of Virtualization and I was really amazed of the idea. At that time co-linux had managed some elementary GUI drawing mainly KDE applications (at least i had seen screeshots), it did not work though on my machine. Just minutes before I downloaded portable ubuntu after reading this Post from LifeHacker. And it works!! just like described. Who needs VMWare and stuff like that if you can work on linux shell as well as Firefox just Alt-Tab apart. Here is screenshot of Portable Ubuntu in action.