Saturday, August 22, 2009

Querying Java Objects stored in Terracotta's NAM Part 3

First and Second part of this series I talked about Querying data structures available in Java. First part specifically talked about existing ones like JoSQL, JxPath and Quaere and discussed indexing problem. Second part specifically talked about lucene and jofti as indexing frameworks and wrote small framework to test and get some performance numbers. This test showed that B-tree index (memory and possibly disk based) is suitable and not lucene index. Third part of this series now I am discussing my own small framework- tcquerymap which is rewrite of Jofti. Jofti is old and not maintained and uses jdk14 libraries. Now even documentation link does not work :

querymap documentation link : Getting Started With Querymap
querymap source download : Complete Source Code with Eclipse Project
querymap TIM : Copy this TIM to terracotta modules directory to use it
querymap terracotta sample app : Download sample eclipse App. You can use Terracotta eclipse plugin to launch it within eclipse

So the basic idea is maintaining in-memory indexes which index String Ids with "Comparable" as Keys. All primitive wrappers in Java implement this interface so no special conversion is needed.

How it works

It scans java objects and makes comparable objects for each of the property mentioned to be indexed. It inserts these comparable objects in its own tree against String Ids assigned to the java object. So in-effect its nothing more than maintaining many in-memory Maps. In fact the implementation that I wrote uses JDK TreeMap and not b-tree map. But in-future it will be replaced by more performant B-tree.

One needs to understand that with in-memory object indexes its only possible to implement subset of SQL query and no join queries. To start with, small framework only implements following operations : arithmetic operands : <,>,<=,>=,==. Logical operands : AND , OR. Range : BETWEEN, Set : IN

To implement querying SQL parser needs to be implemented. I choose to avoid writing parser and implemented direct API kind of querying similar to Quaere. I find Domain specific languages more intuitive since when developer writes code for it he know what querying he writes. So interface looks like below. It is not as good as quaere. Quaere is DSL, below API is just few interfaces implemented. Here execute returns List of IDs put into index against the properties.

import static*;

Collection col = from(Domain.class).where(
gt("inner.property2", 60),

So basically it is equivalent of follwing SQL

select from Domain
where inner.property2 > 60
and inner.property2 < property3=rand(NUM_OBJECTS)

How it performs
Naturally is not as performant as Jofti since it used JDK tree map which uses red-black tree. In future when I complete writing my own b-tree implementation I expect to perform as good as Jofti. Jofti further implemented node level locking so multiple concurrent insert opertations can work parellel. This can also be implemented too. But query performance is not bad and I expect it to improve with b-tree implementation.

Integration with Terracotta

Since it uses TreeMap it is cluster-able easily. Attached tc-config.xml has all correct declarations for it. One more advantage is with Terracotta is that object identifier are readily generated by Terracotta. See implementation TerracottaQueryMap. Please dont compare performance of TerracottaQueryMap against HashMap, CHM or Terracotta Distributed Map(Concurrent String Map old name) since all these are just single index maps so easy to stripe or employ mulitple locks.

For using it as TIM you need to add following lines to tc-config.xml

<module id="" name="querymap-1.0" version="1.0.0">

In code you can use it as queryable map as follows. Here propList is list of properties to be indexed.

TerracottaQueryMap map = new TerracottaQueryMap(Domain.class,proplist));


To Query.

Collection col =map.entrySet(

Future directions planned
Java has hugh limitation for memory intensive application. So achieve scale two approaches : partition index and merge results or use disk to overflow index pages. Other thing I see can be implemented is that when query selects random elements these random elements need to be faulted from Terracotta server thus degrading performance, same elements can be read from local store too like EHCACHE. Later on this topic separately.

If you think this framework is useful please let me know. You can download its source as Eclipse project(sole dependency on tc.jar) and as Terracotta Integration Module here.