Real-time Big Data is becoming more and more common in Big Data systems.
One of the most common frameworks used for running real-time Big Data system is Twitter Storm backed by a NoSQL database, as shown in the diagram below:
In this architecture, Storm is used for processing data as it comes in and the NoSQL data store is used as the output for that processing as well as for reference data storage.
This architecture presents a couple of challenges:
- Performance - Storm runs in-memory, and is therefore set to process large volumes of data at in-memory speed. However, a typical storm architecture needs to interface with a data source for its input and a data source for its output. In this context, the overall performance is now determined not by Storm itself but by the data input and output sources. Quite often these interfaces rely on file-based message queues for streaming, and NoSQL for data storage. These interfaces are at least an order of magnitude slower than Storm itself, and therefore become the limiting factor.
- Complexity - Storm itself consists of several moving parts, such as a coordinator (ZooKeeper), state manager (Nimbus), and processing nodes (Supervisor). The NoSQL data store also comes with its own cluster management. In addition, a typical Big Data system comes with more components such as the application front end, a reporting database, and more. This makes the process of managing the deployment, configuration, fail-over and scaling of such a system quite complex.
Meeting the Performance Challenge
Given that Storm itself runs in memory, it only makes sense that in order for Storm to run at maximum capacity the streaming and data store interfaces should be implemented in-memory as well, as shown below:
As you can see in the above architecture, we added two interfaces that rely on a built-in Storm plug-in - one for the data inputs and the second for data output. In both cases, the underlying implementation is memory-based, and thus removing the impedance that the previous architecture included.
You can read the full details on how this integration works, including a code example, in DeWayne Filppi's post on integration of in-memory computing with Cassandra.
The second optimization that can be added is to the NoSQL data store.
As most analytics applications tend to be read-mostly, we can speed up access to the NoSQL data store using an in-memory local cache. This architecture speeds up read performance between 10 to 1000 times as outlined by Shay Hassidim in his post on real-time Big Data performance.
Meeting the Complexity Challenge
To meet the complexity challenge, we will use a DevOps automation approach using Cloudify in conjunction with Chef or Puppet.
With this approach, we wrap every service with a deployment recipe that abstracts the underlying details on how to manage Storm, Cassandra, and our in-memory data store. Cloudify uses these recipes to automate the deployment of the entire stack. In this way, you only need to interact with Cloudify for the deployment, configuration, and also scaling and fail-over of your stack rather than separately manage each individual component. In addition, we use Cloudify to abstract the underlying infrastructure. This enables us to use the same deployment recipe across different environments. One for testing, another for production, etc. We can also use the same approach to deploy our apps based on the type of workload. We can use a public cloud for running sporadic workload, and thus leverage the elasticity of the cloud and also enable us to create and scale the environment as well as rip it off completely when done. At the same time, we can choose bare-metal machines for I/O intensive workloads, and so on.
You can read more on how you can set Storm using Cloudify on DeWayne's post on Storm and the cloud.
Cloudify comes with a rich set of built-in recipes for other Big Data services as well, making the integration process an out of the box experience. The main ones are listed below:
Final Notes - Optimizing without Code Change
Many real-time Big Data systems are now based on Twitter Storm and a NoSQL data store such as Cassandra.
In this post I tried to outline how we can optimize this architecture by addressing two areas: performance, and management.
The good news is that all this is possible to achieve seamlessly, without any code change. Let me explain:
In the case of Storm we used built-in plug-ins - Spout for streaming and Trident as the data store interface. In this way we can simply plug our memory-based plug-ins under these two integration points. All our existing Storm business logic would work pretty much the same.
We use the same plug-in approach to integrate our in-memory data store with our NoSQL data. This integration makes the data flow between our real time streaming to our NoSQL storage fairly transparently. In addition, it allows us to plug in different NoSQL data stores such as MongoDB, CouchBase, etc., giving us another degree of flexibility.
The same applies to our management layer. Existing Storm and NoSQL deployments are wrapped into plugged in recipes, and don't require any specific code changes.
Not Mutually Exclusive
We don't have to implement all of the optimizations to gain the benefit of this architecture. Each of the optimization points can be plugged in independently of the others. For example, if your main pain point is complexity, you can start by only by adding the DevOps automation first. At the same time if your main pain point is performance, you can use the memory-based plug-ins to speed up processing.
The other advantage of the combined architecture that I haven't discussed in this post is that it provides more flexibility in the degrees of consistency and availability in which you set your system, as I outlined in one of my recent posts on the subject In Memory Computing (Data Grid) for Big Data.