In one of my earlier posts I discussed in general terms why it makes a lot of sense to put Big Data on the cloud.
The first step that we took in this regard with Cloudify was to make NoSQL databases, such as Cassandra and MongoDB, run on any the cloud through Cloudify recipes. Uri Cohen gave an excellent talk on this during the Cassandra Summit where he provided a insights into this work.
In the past couple of weeks we've been working on the next phase of this project: Putting Hadoop on the cloud.
As there are various solutions that aim to do something similar, I thought that a good start would be to first outline where we fit in this ecosystem.
Cloudifying Hadoop -- What Does That Actually Mean?
Yet another Hadoop distribution?
There are multiple distributions of the Hadoop framework today. All come with strong management and sets of tools and do a fairly good job. With that in mind, it was clear to us that our goal wouldn't be to come up with yet another Hadoop distribution but rather to integrate with the ones that are already out there. We picked IBM BigInsights and the Cloudera distribution as the first targets for the project.
When we realized that this was the path we wanted to pursue, the main question that immediately came up was:
What value can we add on top of IBM BigInsights and the Cloudera distribution?
Big Data systems tend to be complex to manage and operate. BigInsights and Cloudera provide good tools to make the process of configuring and deploying Hadoop significantly simpler. That's good for the Hadoop part of the story. Big Data systems and applications often include other services such as relational databases, other NoSQL databases such as Cassandra or MongoDB, stream processing such as Twitter-Storm and GigaSpaces XAP, web front ends such as Tomcat, Play framework and Node.js. Each framework comes with its own management, installation, configuration, and scaling solutions, as shown in the diagram below:
Managing each component of your Big Data system seperately is an operational nightmare. That complexity grows exponentially as the system gets bigger, and in Big Data that's just to be expected.
We realized that one of the areas in which we can reduce this operational complexity is through Consistent Management. By Consistent Management, I'm referring specifically to consistent deployment, configuration, and management across the stack. Consistent management applies not only to the deployment phase, but also to post-deployment, including fail-over, scaling, and upgrades. In addition, Big Data systems tend to consume a lot of infrastructure resources that can easily pile up to thousands of nodes. We realized that we can optimize the infrastructure cost for running the Big Data system through Cloud Enablement and Cloud Portability. Cloud Portabiltiy enables you to choose the right cloud for the job. For example, you could choose a bare-metal cloud for I/O intensive workloads or a virtualized/public cloud for more sporadic workloads. Below is a more detailed description on how we implemeted these two properties:
1. Consistent Management
With consistent management, we wanted to make the experience of managing each of the tiers and services in the Big Data System consistent throughout the entire stack. This is where Cloudify plugs in.
Cloudify already plugs into a variety of web containers, databases, and through the integration with Chef also into hundreds of services available through the Chef Cookbook.
The process for achieving consistent management for BigInsights and Cloudera Hadoop distribution basically maps to the creation of a Cloudify recipe. The purpose of the recipe in this specific case was to map all the capabilities that come with the IBM BigInsights and Cloudera distribution in a way that would be later consistent with other services in the stack.
For this, we needed to come up with the following:
- Deployment automation -- A Cloudify recipe enables us to automate the installation, configuration, and deployment of a given service. In the context of a Hadoop distribution, which comes with its own scripts for automating these phases, the process of creating a Cloudify recipe basically meant mapping the specific Hadoop distribution scripts into the Cloudify convention, as well as using the Cloudify discovery, global context, and other services to dynamically inject configuration values that are driven on-the-fly through the deployment process. In the case of BigInsights, this basically mapped into a simple recipe that would launch the machines, set up the SSH environment, and update the BigInsights configuration file with the relevant information. We then called BigInsights to install the NameNode, DataNode, Management, Hive and other services that come with the BigInsights distribution.
- Automation of post-deployment operations -- In the Hadoop scenario, post-deployment operations mapped into the abilities to add a node, rebalance, and test the environment. To automate these processes, we used Cloudify custom commands. Custom commands enable us to give an alias to those maintenance scripts, and then it enables the users to call those scripts using a simple naming convention without knowing the physical location of Hadoop deployment.
- SLA-based monitoring and auto-scaling -- SLA-based monitoring enables measure how the Hadoop distribution behaves after it's been deployed, and also to map specific actions when a given SLA is breached. For example, you can monitor a situation where a node fails and then spawns a new machine to take over that fail-node. You can also use the monitors to trigger any of the actions defined in the custom commands, which basically means that you can automate not only a fail-over process but also the ability to scale, by adding more capacity. To do this with Cloudify, use Custom Metrics to monitor the specific metrics that are of interest. One great power of custom metrics IMO is that you're not bound to metrics that are provided by the Hadoop distribution, and you can easily generate compound metrics that can be generated out of correlating metrics from multiple sources, such as such as network statistics or a cloud management system. You can then use the Cloudify Scaling Rules to attach rules to each of those metrics and trigger a fail-over or scaling event.
2. Cloud Enablement and Portability
The Cloud is basically a more efficient way to run IT. Today, there are growing numbers of cloud offerings and types of clouds, each providign different SLAs and pricing models. There are the public clouds such as Amazon, Rackspace, and also HP, IBM and Dell who's coming out with their own cloud offering. Microsoft and Google are also coming out with new offerings that will undoubtedly change the dynamic in this space. There are also the bare-metal clouds, the private cloud, the OpenSource cloud, etc., and obviously the VMware cloud, which are also an important part of the mix.
Enabling your Big Data systems for all these environments means you can leverage all the agility, efficiency, and flexibiltiy of the cloud, while also allowing you to choose the right cloud for the job.
Given the rapid dynamic of the cloud world, you also want to keep your options open so that you can leverage any future developments and services as they become available.
Controlling Your Cloud Infrastructure is a Critical Part of Controlling Your Big Data Costs
There are various ways to control Big Data costs through controlling infrastructure. Here are a few examples:
- Bare-Metal -- Reduce the number of machine instances for high I/O workloads.
- Private-Cloud --Optimize performance for your specific workload through specialized hardware (Inifiband, dedicated network, specialized hard drives, etc.) and also have better control over our data secrurity.
- Hybrid-Cloud -- Offload some of the work into the public cloud and optimize costs through more elastic computing.
Take Zynga for example. Zynga recently launched their own private cloud offering, zCloud, and is now using a hybrid cloud strategy in which they move some of the heavy workload from the Amazon cloud into their private cloud environment. With this approach Zynga was able to increase utilization by 3x, which means that they will need 1/3 of the servers that they would need from Amazon for the same workload, as noted by CTO Allan Leinwand on Zynga’s engineering blog:
For social games specifically, zCloud offers 3x the efficiency of standard public cloud infrastructure. For example, where our games in the public cloud would require three physical servers, zCloud only uses one. We worked on provisioning and automation tools to make zCloud even faster and easier to set up. We’ve optimized storage operations and networking throughput for social gaming. Systems that took us days to set up instead took minutes. zCloud became a sports car that’s finely tuned for games.
How Does Cloudify Add Cloud Portability to Your Hadoop Distribution?
Cloudify recipes uses a feature called machine templates. Machine templates are an abstraction of the underlying compute resource. The mapping of the machine template into the particular cloud infrastructure is done through the Cloud Driver.
Cloudify comes with a large and continuously growing list of Cloud Drivers available for HP, Rackspace, Amazon, Azure, and CloudStack, as well as for completely non-virtualized environments, which is basically just a bunch of bare-metal machines with an IP and network. The Cloud Driver also plugs in with JClouds and as such can plug into any cloud that is supported through the JClouds framework.
With all this, running your Hadoop deployment on any cloud becomes just a matter of a simple configuration of the target cloud end-point.
Current Status:
We've made the entire project available on GitHub. The technical description of that work is provided here. These days, we're working closely with our IBM partners to harden and optimize our BigInsights integration, and will be coming out with more updates shortly. Obviously, this is meant to be an ongoing process, so I'd really appreciate any feedback or comments or contributions.
In the next post we will outline more specifics about the work with IBM BigInsights.
Final Notes
Adding consistent management and cloud portability to your Big Data system allows you to reduce the operational and infrastructure cost of running Big Data systems. It's clear to me that this is just the begining for what we can achieve through this work. The flexibilty that we can add through the integration of Cloudify with IBM BigInsights and the Cloudera distribution will allow us to easily plug into other services and cloud infrastructures. It will also allow us to deploy and manage even the most complex systems through a single-command... So stay tuned.
References
- Big Data in the cloud with Cloudify
- Cloudify and IBM InfoSphere BigInsights
- The Rise of Bare metal Clouds
- Lessons from Zynga & Sony on moving from Amazon AWS
- Putting BigData/ Cassandra on the Cloud (Online session from the Cassandra Summit 2012)
- Cloudify Hadoop Recipies peoject on GitHub