Tushar Khairnar's Tech Blog: May 2009

First part of this series, I talked about existing frameworks and what I found out is that they lack indexing hence rarely useful for large data-sets. So I tried finding out how to do indexing. My idea was simple : index objects and store reference to object in index then with Terracotta you can cluster objects and index as well. So it becomes "queryable" datastore. My first attempt was to find out how indexing is done. By book it says b-tree index. I found out this(jdbm) framework which is trying to do the persistent DB in Java by implementing B-tree indexes on disk.I took only b tree and implemented simple query parser. What it does is that it traverses b-tree and finds out tuples and then returns them.

After this first attempt, then I experimented with Lucene. Lucene is not tree-index, its inverted index but it has lot of capability and its fast, supports in-memory and disk-based indexes.

So here is my little framework for queryable datastore :



 public interface TCQueryMap extends Map{
  void init();
  Map query(String query);
 }

Naturally its extension of Map which is single index. My implementation wraps a HashMap with ReadWrite locks and Lucene RAMDirectory index. So all get/puts hit index within lock boundaries and then you can query index. This is very simple, I have not gone into complexities like spill-over of index onto disk etc.


 LuceneIndexingConfig config = new LuceneIndexingConfig();
 List propList = new ArrayList();
 // index three properties only
 propList.add("accountName");
 propList.add("person.age");
 propList.add("person.name.firstName");
 
 config.setIndexPropertyList(propList);
 
 TCQueryMap indexer = new LuceneQueryStore(config);
 
 // add object
 Acccount account .....
 indexer.put("user99",account);
  
 // query object 
 Map col = indexer.query("person.age:21");

I also came to know about Jofti from one of comments. Jotfi is what I would eventually like to write. I don't know why its not used by many people. One reason could be its not maintained. I found it pretty useful so I plugged in Jofti as well in my framework.

Now lets compare it with simple Hibernate-JDBC based solution. Obviously its not perfect. SQL is way more complex and expressive language. But here we are talking about cached data and I am sure once data is cached ( it means objectified from join query on relational DB) very few times you will require join, its mostly "where clause" of one or more conditions. Lucene does that very well.

So lets see numbers. I have not done any tuning apart from standard lucene stuff. One of main parameters is how many properties you want to index. This determines index size, memory and speed.

Below is small benchmark showing 60K objects inserts with three properties indexed and then random queries on three properties

Lucene Inserts/Sec = 1000
Jofti Inserts/Sec = 8793.78
Lucene Queries/Sec = 5172
Jofti Queries/Sec =13636.36

Results for 14 properties indexed :

Lucene Inserts/Sec = 740
Jofti Inserts/Sec = 4866.96
Lucene Queries/Sec = 3750
Jofti Queries/Sec = 12500

Since Jofti is Tree index it outperforms Lucene Index. The problem with Lucene is that once index gets bigger insert performance slows down. Also these numbers are taken with one commit on one put operation. If you index lot of objects together and then commit, lucene is also fast, that's how it is to be used - Batch API. On the other hand Jotfi is fast, I could not find any details about being thread-safe and other concurrency issues so I wrapped it around Lock. I don't know why Jofti is not used by many people.

Also what if you can run Hibernate/JPA queries on Map? that would be great. Its already done by hibernate team. They run query against Second level cache but it would big task to find out and extract idea out of it. Just a thought. Second Level Query cache gets invalidated when you modify single entity, imagine if we update the same object in QueryMap cache you dont need Query cache of course querying capability is not great.

Another thought that comes in my mind is clustering in-memory databases like H2 or HSQLDB. Imagine the benefits of it. But then its anti-terracotta. Why? it would be Relational DB with baggage of ORM mismatch.

Entire source code you can download it from here : http://code.google.com/tc-querymap/. Tar file is just bunch of java files and its very early
prototype. Stay tuned to project, I will update it once I finish with proper integration with Terracotta.

So if you find it useful please leave comments, I would love to hear from you.

Tushar Khairnar's Tech Blog

Monday, May 25, 2009

Querying Java Objects stored in Terracotta's NAM Part 2

Monday, May 4, 2009

Got one!!