In my previous post (YeSQL: An Overview of the Various Query Semantics in the Post Only-SQL World ) I introduced the common query semantics in the post Only-SQL world and made an argument that the right approach is to decouple the query semantics from the underlying NoSQL implementation. This would allow us to combine SQL semantics with NoSQL backend to achieve the best of both worlds -- standard query and scalability.
In this post I wanted to illustrate this idea through some code example using GigaSpaces as the underlying implementation of the concept. The example is based on the the same code examples from another earlier post, WTF is Elastic Data Grid? (By Example) where I covered some of the simple models for writing and reading objects to a distributed data grid. Toward the end I will reference other examples such as Datanucleus and Hbase, as well as cover patterns for supporting NoSQL semantics such as document model with existing RDBMS such as MySQL.
SQL/NoSQL Code Example (Using GigaSpaces)
The Data Model
@SpaceClass
public class Data implements Serializable {
private Long id;
private String data;
private Map
public Data(long id, String data , Map
this.id = id;
this.data = data;
this.info = info;
}
@SpaceRouting
@SpaceId
public Long getId() {
return id;
}
// getter/setter for id, data, info attributes omitted
}
As with the previous example, we use @SpaceRouting annotation to set the routing index. This index determines which target partition this instance should belong to. @SpaceId determines the unique identifier (Key) for this instance.
Adding Document Model Semantics
Most of the common examples for document-based APIs use a JSON data model as the document. A document in JSON world would look something like this:
Source: Wikipedia
From a NoSQL implementation perspective, a document often translates to a Map of Maps where each attribute is mapped to a key,value representation and nested value to a key whose value is a Map and so forth.
Document Model in GigaSpaces
In the current version of GigaSpaces a document is basically a Map attribute whose values are indexed. That gives the flexibility to add,remove attribute on an existing object without changing the object schema as with any schemaless API. A nested object would be mapped to a key whose value is a complex object that is stored just like any regular POJO.
In this example I added an info attribute that is basically a Map of Key/Values.
private Map
One of the main benefit of this approach is the combination of a strongly typed POJO/Table model with the flexibility of the document model. In other words, I could match objects by type, and each object can have a “fixed” structure that must conform to certain type, and a variable part that can vary on a per instance basis. In our specific example, when I look for a Data object I'm guaranteed to get the Data.id and Data.data, but Data.info could include a variable list of attributes that can vary on a per instance basis. I can still “pin” some attributes by forcing an indexing on the relevant keys as illustrated below:
// this defines several indexes on the same info property
@SpaceIndexes( { @SpaceIndex(path = "info.address", type = SpaceIndexType.BASIC),
@SpaceIndex(path = "info.socialSecurity", type = SpaceIndexType.BASIC)
})
public Map
return info;
}
Note: As of GigaSpaces 8.0 release the document model semantics would be extended to support full hierarchy of Maps of Maps and dynamic indexes.
Combining SQL Query with (GigaSpaces) NoSQL Data Store
Now that we’ve gone through the API example let’s see how we could insert this data into a NoSQL data store (GigaSpaces in this specific case) and query it through SQL.
Inserting the Data Object
for (long i=0;i<1000;i++)
{
gigaSpace.write(new Data(i,"message" + i, createInfo(i)));
}
The createInfo() generates a new Map and fill it with values as can be seen below:
public static Map
{
Map< String, Object> info = new HashMap
info.put("address", i + " Broadway");
info.put("socialSecurity", 1232287642L + i);
info.put("salary", 10000 + i);
return info;
}
Querying the NoSQL document data using SQL
There are basically two models for querying data using SQL in GigaSpaces. (1) Adding SQL-like queries using the GigaSpaces SQLQuery API (2) using a fully standard SQL jdbc driver.
Querying the data using a SQL-like model
Data[] d = gigaSpace.readMultiple(
new SQLQuery(Data.class, "info.salary < 11000 and info.salary >= 10000"),
Integer.MAX_VALUE);
The gigaspaces.readMultiple(..) operation is equivalent to a “select” statement. It takes the class (the equivalent of the from table name clause in SQL), and query clause “info.salary < 11000 and info.salary >= 10000” (the equivalent of the where clause in SQL) .
As we can see the syntax of the SQL query borrows the syntax of an object-oriented model where I can reference to the associated attributes within the document attribute just like any nested attribute. In our example info.salary would point to the info Map and pull the attribute who’s key=”salary”.
This API is useful if you're already working in POJO as your domain model, as it works with objects natively and therefore can bypass the need for any O/R Mapping.
Querying the data using standard JDBC
In this example we will illustrate how we can plug different query engines to the same data instance. In this case we will use a standard JDBC connection to connect to the data in the following way:
Connection conn;
Class.forName("com.j_spaces.jdbc.driver.GDriver").newInstance();
String url = "jdbc:gigaspaces:url:jini://*/*/myElasticDataGrid";
conn = DriverManager.getConnection(url);
Statement st = conn.createStatement();
String query = "SELECT * FROM com.gigaspaces.examples.Data";
ResultSet rs = st.executeQuery(query);
// Iterate through the result set
int i = 0;
while (rs.next()) {
System.out.println("Data [" + (i++) +"] "+ rs.getString("data") );
}
The first line sets up the GigaSpaces JDBC driver. The GigaSpaces JDBC driver is responsible for mapping the SQL query language into a the underlying GigaSpaces methods. That means that from a GigaSpaces data store perspective JDBC calls looks pretty much the same as any other call. The URL uses the following pattern <gigaspaces jdbc prefix>:<gigaspaces space url>, where the gigaspaces-jdbc-prefix is always set to jdbc:gigaspaces. The space URL points to the relevant data grid cluster.
Note that one of the interesting concepts that comes with this is that you don’t need to point to a specific host as you would in most of the today's databases, but rather we use “*” as the host name, which initiates a network discovery using multicast to find the relevant instances of the cluster.
The rest of the code looks just like any other SQL call. We use fairly basic mapping where every class is mapped to a table and and an object attribute is mapped to a column. Note that unlike complex O/R mapping we leverage the fact that a space can store objects in their native format. That means that we don’t need to break nested objects into different tables but instead we store them as single embedded java object where the relationships are kept consistent with their original java representation.
Since the standard SQL doesn’t support nested object queries our current version of the JDBC provides access to the top level attributes. Any nested object is treated as a POJO and is matched according to the POJO based template matching semantics. Currently, this limits the type of queries that one can perform on nested object compared with the SQL-like API because we can’t perform matching on individual fields within the document or any nested object. As the SQLQuery and JDBC layer use the same underlying query engine, this limitation is only a semantic limitation and not a technical limitation, and will be resolved in the next release of GigaSpaces.
What About Performance?
Alex Popescu, co-founder and CTO of InfoQ.com, made a comment in his post NoSQL Databases Should Support SQL Queries questioning the performance overhead associated with adding another layer of indirection, as I pointed out in my original post. He rightfully quoted Jeff Kesselman:
The two software problems that can never be solved by adding another layer of indirection are that of providing adequate performance or minimal resource usage.
Indeed, this is one of the main challenge in this entire discussion. It is relatively simple to add another level of indirection but it's almost impossible to make it perform well.
The key to addressing the performance challenge relies on the implementation of the underlying data store and how well it is suited to support the functionality required by the higher level abstraction. In our specific NoSQL discussion the performance of the SQL abstraction would be greatly influenced by the ability of the underlying NoSQL data store to support complex queries at the core level. In this case, adding a different set of query semantics would be a matter of simple syntax mapping which should yield negligible overhead and in some cases could turn out to be more efficient, as we could support algorithms that are not simple to implement at a lower level. A good example is found in Kevin Weil's presentation Hadoop, Pig, and Twitter (slides 17-18). Kevin provides an example of how a simple Pig query could map to a fairly complex hadoop task in Java. So even if a lower level API could be more efficient at a micro level, it might turn out to be less efficient at the macro level as a result of the associated complexity.
In our example the abstraction provided by putting Datanucleus on top of Google BigTable is probably going to yield fairly high overhead when it comes to complex queries, because most of the query semantics are implemented at the mapping layer. With GigaSpaces, we chose to support all the query semantics at the core layer and make the SQL abstraction a thin semantics mapping layer. A recent benchmark test showed only 2% difference between JDBC queries and native queries.
GigaSpaces Native Query engine makes the overhead
of the SQL abstraction negligible
Other SQL/NoSQL References
Datanucleus provides a JPA/JDO mapping layer ontop of various datastore implementations.
As such, it well positioned to provide a common SQL abstraction layer. It is also well advanced in the area of becoming a common mapping layer for multiple NoSQL datastores. Currently it supports HBase, BigTable, Amazon S3. MongoDB and Cassandra are currently works in progress.
As there is still a big difference between the various datasource implementations (each supports only a subset of query semantics), each datasource plugin covers a subset of Datanucleus features. There are various documented limitations per datastore. You can find the list of supported features in the relevant datastore plugin documentation (a reference to googlestorage is provided here ).
You can also find a useful code example for using the Datanuclues JPA plugin for Hadoop/Hbase in the following post Apache Hadoop HBase plays nice with JPA by Matthias Wessendorf.
RDBMS Support for NoSQL Semantics
As I was working on this write-up, which centers around supporting SQL on top of a NoSQL datastore, I came across an interesting post: Schema-Free MySQL vs NoSQL by Ilya Grigorik, CTO/Founder at PostRank.
In his post, Ilya provides an interesting pattern that shows how you can store a schemaless document model on a MySQL datastore.
Instead of defining columns on a table, each attribute has its own table (new tables are created on the fly), which means that we can add and remove attributes at will. In turn, performing a select simply means joining all of the tables on that individual key
Ilya ends with a comment that I thought was in line with the my thoughts outlined in this post:
..there is absolutely no reason why we can’t have many of the benefits of “NoSQL” within MySQL itself.
Final Words
In this post I've tried to illustrate, mainly through the GigaSpaces example, how the SQL and NoSQL models can be brought together to achieve the best of both worlds, i.e., the scalability of the NoSQL model and the rich and standard query semantics of SQL. At the same time, I believe that additional semantics that are not currently supported by SQL such as the document model and object relationship can also be brought together as an extension to current SQL semantics as illustrated in this post.
Having said that, there will be a time where a Object or document-centric language are more intuitive or simpler to use then SQL. The decoupling of the datastore and the query language should enable us to write objects in an Object/Document-centric model and still query it using a SQL engine and vice versa. This will give us more flexibility to choose the language that best fits the context in which it is being used, and mix and match among them just as we do today in any web application where we mix together different languages such as HTML, CSS, JavaScript and other languages.
Finally, I believe that large part of the discussion in the NoSQL community took a wrong turn by putting too much emphasize on the query language rather then the scalability patterns which IMO should remain the only motivation for switching between one datastore to the other.
References
- YeSQL: An Overview of the Various Query Semantics in the Post Only-SQL World
- WTF is Elastic Data Grid? (By Example)
- Apache Hadoop HBase plays nice with JPA by Matthias Wessendorf
- Hadoop, Pig, and Twitter presentation by Kevin Weil
- Schema-Free MySQL vs NoSQL by Ilya Grigorik
- NoSQL databases Should Support SQL Queries by Alex Popescu