It's time to think of the architecture and application platforms surrounding "Big Data" databases. Big Data is often centered around new database technologies mostly from the emerging NoSQL world. The main challenge that these databases solve is how to handle massive amount of data at a reasonable cost and without poor performanc - distributed databases emerged to address this challenge and today we're seeing high adoption rate and quite impressive success stories such as the Netflix use of Cassandra/DataStax solution. All that indicate the speed in which this market evolves.
The need for a Big Data Application Platform
Application platforms provide a framework for making the development of applications simpler. They do this by carving out the generic parts of applications such as security, scalability, and reliability (which are attributes of a 'good' application) from the parts of the applications that are specific to our business domain.
Most of the existing application platforms such as Java EE and Ruby on Rails were designed to work with centralized relational databases in mind. Clearly, that model doesn’t fit well to the Big Data world simply because it wasn’t designed to deal with massive amount of data in first place. In addition to that, frameworks such as Hadoop are considered too complex as noted in VP/Research Director for Forrester Research Mike Gilpin, in his post, "Big Data" technology: getting hotter, but still too hard:
"Big Data" also matters to application developers - at least, to those who are building applications in domains where "Big Data" is relevant. These include smart grid, marketing automation, clinical care, fraud detection and avoidance, criminal justice systems, cyber-security, and intelligence.
One "big question" about "Big Data": What’s the right development model? Virtually everyone who comments on this issue points out that today’s models, such as those used with Hadoop, are too complex for most developers. It requires a special class of developers to understand how to break their problem down into the components necessary for treatment by a distributed architecture like Hadoop. For this model to take off, we need simpler models that are more accessible to a wider range of developers - while retaining all the power of these special platforms.
...to take maximum advantage of big data, IT is going to have to press the re-start button on its architecture for acquiring and understanding information. IT will need to construct a new way of capturing, organizing and analyzing data, because big data stands no chance of being useful if people attempt to process it using the traditional mechanisms of business intelligence, such as a data warehouses and traditional data-analysis techniques.
To effectively write Big Data applications, we need an Application Platform that would put together the different patterns and tools that are used by pioneers in that space such as Google, Yahoo, and Facebook in one framework and make them simple enough so that any organization could make use of them without the need to go through huge investment.
Here's my personal view on how that platform could look like based on my experience covering the NoSQL space for a while now and through my experience with GigaSpaces.
Characteristics of Big Data Application Platform
As with any Application Platform, a Big Data application platform needs to support all the functionality expected from any application platform such as scalability, availability, security, etc.
Big Data Application platforms are unique in the sense that they need to be able handle massive amounts of data and therefore need to come with built-in support for things like Map/Reduce, Integration with external NoSQL databases, parallel processing, and data distribution services and on top of that, they should make the use of those new patterns simple from a development perspective.
Below is a more concrete list of the specific characteristics and features that define what Big Data Application Platform ought to be. I've tried to point to the specific Java EE equivalent API and how it would need be extended to support Big Data application.
Support Batch and Real Time analytics
Most of the existing application platforms were designed for handling of transactional web applications and have little support for business analytics applications. Hadoop has become the de facto standard for handling Batch processing; Real Time analytics, however, is done through other means outside of the hadoop framework, mostly through an event processing framework, as I already outlined in details during my previous post Real Time Analytics for Big Data: An Alternative Approach.
A Big Data Application Platform would need to make Big Data application development closer to mainstream development by providing a built-in stack that includes integration with Big Data databases from the NoSQL world, and Map/Reduce frameworks such as Hadoop and distributed processing, etc.
It also needs to extend the existing Transaction processing and Event Processing semantics that come with JavaEE for handling of Real Time analytics that fits into the Big Data world as outlined in the references below:
Making Big Data Application Closer to Mainstream development practices
Domain model and Data access API
Writing Big Data application is quite different than writing a typical CRUD application to a centralized relational database in many forms. The main difference is with the design of our data domain model, and the API and Query semantics that well use to access and process that data.
Mapping has proven to be a fairly effective approach to map the impedance mismatch between different data models and API’s. A good reference on that regard is the use of O/R mapping tools such as Hibernate for bridging similar impedance mismatch between Object and Relational models.
Abstraction frameworks such as Spring has also proven to be quite useful means to plug-in different data sources that doesn’t comply to common API through plug-in approach.
To make the development of Big Data Application closer to mainstream development Big Data application platform need to come with similar mapping and abstraction tools that would make the transition from the existing API’s significantly smoothers.
Today we already have various mapping and abstraction framework available.
For Big Data Mapping tools:
For batch processing were already seeing the increase adoption of frameworks such as Hive which provide an SQL like façade for handling complex batch processing with hadoop.
JPA provides more standard JEE abstraction that fits into Real Time Big Data application. Google App Engine uses Data Nucleus on top of Big Table, With GigaSpaces we came up with a high performance JPA abstraction for our In Memory Data Grid using Open JPA, and Redhat came up with Hibernate OGM (Object-Grid Mapping).
In addition to these, we're seeing that the existing mapping tools add extensions to support data partitioning semantics through annotations as in the case of EclipseLink and others.
For Big Data Spring Abstraction:
SpringSource came out with an interesting and even more high level abstraction known as Spring Data that makes it possible to map different data stores of all kinds into one common abstraction through annotation and plug-in approach. The abstraction is leaky currently, but shows progress in the space.
In Java EE, business logic refers to the parts in our application that are responsible for processing the data. As with JavaEE the data often lives in a centralized database SessionBean was often mapped to a single instance per user session.
With BigData the common pattern for processing data is through Map/Reduce. Map/Reduce was designed to handle the processing of massive amount of data through moving the processing logic to the data and distributing the logic in parallel to all nodes. Having said that, developing parallel processing code is considered fairly complex.
A Big Data Application Platform would need to make Map/Reduce and parallel execution simple. One way of doing that is by mapping this semantics into existing programming models. One example would be to extend the current SessionBean model to support this sort of semantics (as with the GigaSpaces Remoting semantics) – this makes parallel processing look like a standard remote method invocation. Another way is to provide more native parallel execution semantics by extending existing semantics such as createNativeQuery in JPA, or the executor API in Java.
Interestingly enough, at the time of writing this post I came across Robin van Breukelen's post, "Distributed Fork Join," that shows how you can use the latest Java 7 fork/join API in similar context.
Support new semantics that fits the dynamic web era
Clearly, SQL is a great query language (or else it wouldn't have taken over for database access!) but SQL also has its limits. When we're writing web content, we often deal with dynamic data structures that continuously evolve. When the amount of this data gets massive, as with Big Data, it becomes almost impossible to manage those changes through a standard SQL schema evolution model.
To address this need, BigData platforms need to add support schemaless semantics as a first class citizen. That often means that the data mapping layer mentioned earlier would need to be extended to support document semantics. An example for that is MongoDB, CouchBase, Cassandra and the new GigaSpaces Document API (which, as it turns out, maps to MongoDB and Cassandra very easily.)
Provide built-in semantics for handling of the tradeoffs between consistency, availability, scalability rather than trying to force a least common denominator as with XA and JTA
JEE application platform forced a fairly strict consistency model through JTA, Distributed Transactions, and so forth. Unlike Java EE, Big Data Application platforms need to support more relaxed versions of those semantics that would enable flexibility between consistency, scalability, and performance. A good example on how that configuration may look like is covered in Cassandra/DataStax documentation and one of my previous post on the subject of CAP theory.
Provide built in support for In-Memory data store to gain best performance and latency
RAM-based devices provide the best performance/latency results. Big Data platforms need to provide seamless integration between RAM and Disk based devices where data that is written in RAM would be synched into Disk asynchronously. In addition to that, they need to provide common abstractions that enable users to use the same Data access API for both devices and thus make it easier to choose to the right tool for the job without changing the application code.
Good starting points on that regard are the JPA abstractions now being supported by In-Memory Data-Grids and NoSQL data stores as well as the Spring Data abstraction as noted above.
Provide built-in support for event driven data distribution using pub/sub model
The current Java EE frameworks provide limited support for event processing through message-driven beans and JMS. For Big Data applications, we need to add data awareness to the current MDB model that make it easy to route messages based on data affinity and content of the message. We also need to provide more fine grained semantics for triggering events based on data operation (delete, update,..) as well as content as with CEP (Complex Event Processing).
In addition to that we need to provide higher level abstractions that use the underlying pub/sub support to enable data synchronization and locality. A good example for that is the use of LocalCache and LocalView where LocalCache is useful for random access patterns by caching the data that was used most recently and LocalView which provides means to Cache a specific segment of data that is known in advance. Both LocalCache and LocalView use the underlying pub/sub support to ensure continuous synchronization of the data from the BigData pool by tracking changes on the Big Data pool and updating the local copy of the data implicitly.
Built in support for public/private cloud
Big Data applications tend to consume lots of compute and storage resources. There are a growing number of cases where the use of the cloud enables significantly better economics for running Big Data applications. To take advantage of those economics, Big Data Application Platforms need to come with built-in support for public/private clouds that will include seamless transition between the various cloud platforms through integration with frameworks such as JClouds. Cloud Bursting provides a hybrid model for using cloud resources as spare capacity to handle load. To effectively handle Cloud Bursting with Big Data well have to make the data available for both the public and private side of the cloud under reasonable latency – which often requires other services such as data replication.
Open & Consistent management and orchestration across the stack
A typical Big Data application stack includes multiple layers such as the database itself, the web tier, the processing tier, caching layer, the data synchronization and distribution layer, reporting tools, etc. One of the biggest challenge is that each of those layers come with different management, provisioning, monitoring and troubleshooting tools. Big Data applications tend to be fairly complex by their very nature; the lack of consistent management, monitoring and orchestration across the stack makes the maintenance and management of this sort of application significantly harder.
In most of the existing Java EE management layers, the management application assumed control of the entire stack. With Big Data applications, that assumption doesn’t apply. The stack can vary quite significantly between different application layers therefore the management layer of Big Data Application Platform needs to come with a more open management that could host different databases, web-containers etc., and provide consistent management and monitoring through this entire stack.
Java EE application servers played an important role in making the development of database-centric web application closer to mainstream. Other frameworks such as Spring and Ruby on Rails later emerged to increase the development productivity of those applications. Big Data Application Platforms have a similar purpose – they should provide the framework for making the development, maintenance and management of Big Data Applications simpler. In a way, you could think of Big Data Application platforms as a natural evolution of the current application platforms.
With the current shift of Java EE application platforms toward PaaS, we're going to see even stronger demand for running Big Data applications on cloud based environments due to the inherent economic and operational benefits. Compared to the current PaaS model, moving data to the cloud is more complex and would require more advance support for data replication across sites, cloud bursting etc.
The good news is that Big Data Application platforms are being implemented with these goals in mind, and you can already see migration yielding the benefits one should expect.
- NoSQL Job Trends – August 2011 (Dzone)
- Big Data" technology: getting hotter, but still too hard (Forrester)
- Big Data Requires a Big, New Architecture (Forbes)
- Real Time analytics for Big Data: Facebook's New Realtime Analytics Syste
- Example for Big Data application using Cloudify