Big data and cloud technology go hand-in-hand. Big data needs clusters of servers for processing, which clouds can readily provide.
Edd touched briefly on the role of PaaS for delivering Big Data applications in the cloud
Beyond IaaS, several cloud services provide application layer support for big data work. Sometimes referred to as managed solutions, or platform as a service (PaaS), these services remove the need to configure or scale things such as databases or MapReduce, reducing your workload and maintenance burden. Additionally, PaaS providers can realize great efficiencies by hosting at the application level, and pass those savings on to the customer.
Even though Edd’s article covers all the different forms of running Big Data on private and public clouds, the article focuses mainly on the public cloud offering from Amazon, Microsoft and Google.
In this post, I wanted to cover more specifically how I see the evolution of cloud application platforms (PaaS) to support Big Data. I’ll refer specifically to Cloudify which was designed primarily to support Big Data applications.
Big Data in the cloud using Cloudify
Most of the PaaS solutions out there started by focusing on simple web application deployments on Ruby, Java and Node.js. Unlike other PaaS solutions, when we designed Cloudify we picked Big Data as the primary target for Cloudify, and started by supporting popular NoSQL clusters such as Cassandra and MongoDB, as well as providing the equivalent of Amazon RDS by providing recipes for MySQL. Our goal was to make Big Data deployments a first class citizen within Cloudify. To this end,when you download Cloudify you'll notice that ALL of our examples comes with pre-integrated Big Data deployments.
There are couple of reason that brought us to make that decision:
Managing large data clusters is a core expertise at GigaSpaces
Most people know GigaSpaces for our In-Memory Data Grid solution known as XAP (eXtreme Application Platform). Over the past 10 years, as our customer deployments grew substantially, we realized that developing strong automation and cluster management is as critical as handling data consistency, performance, and latency in of our data-grid product. In a large cluster, if something breaks it’s going to literally be impossible to handle that failure through manual procedures. For that reason, we developed lots of IP around automation of data cluster deployment which resulted in a unique self-managed data cluster.
Cloudify is a natural evolution of GigaSpaces Data Cluster
When we built Cloudify it made a lot of sense to take the IP that we developed for managing GigaSpaces cluster and simply generalize it so it would fit with any other framework. In this way, we could leverage the years of experience as well as development in this area, and gain a significant head-start.
Big Data applications are complex
Big Data applications tend to be fairly complex, which makes them an ideal candidate for the sort of automation and management that Cloudify can offer.
Big Data applications have a lot in common with XAP applications
Both need automation of data, failover and recovery, both fit into large cluster deployments, and both share similar partitioning and other clustering architecture.
What makes Big Data platform different than any other application
Most of the existing orchestration systems were designed to handle stateless processes. Moving data is a completely different ballgame as you need to think of:
Primary and Backup dependency
Availability - moving data without losing it.
Moving processes to the data rather than the other way around.
Data replication within and across sites
Automating any of these processes through general orchestration tooling like Chef or Rightscale can become a fairly involved and complex process, with lots of pitfalls with handling edge scenarios for example the handshake process that is often involved when automating a data node failure, including a split-brain scenario.
In Cloudify we were able to curve out lots of that logic from the user, for example Cloudify will automatically ensure that primaries and backups won’t run on the same node or data center in case of disaster recovery. You don't need to do anything but tag your cluster nodes with a zone-tag.
Managing Big Data applications != Managing Big Data storage
Managing data clusters is one thing. Being able to process the data is yet another challenge that we need to think about when we’re dealing with application platforms as I noted in one of my earlier post.
The main challenge is that quite often the management of the data processing logic is built on completely different scaling, availability and monitoring tools than the one used for managing our Big Data deployment. It turns out, that this silo thinking leads to a whole set of complexities starting from the inconsistency in having multiple managers, each determined in a different way when there is a failure or scaling event, and that quite often end up conflicting with one another. Having lots of moving parts is yet another challenge that makes the entire deployment pretty much a complete mess.
Built-in support for In Memory Stream based processing
Over the next few years, we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.
Being also part of XAP, Cloudify already comes with built-in support for streaming Big Data processing. This means that building your own Facebook or Twitter-like real-time analytics can be as a simple as writing only small scripts that handle the analytics counters. All the rest, i.e. scalability, availability, automation, cloud portability, management and monitoring, is covered by Cloudify as noted in this and this use case.
Examples for Big Data applications running on Cloudify
In the list below, I tried to put together a couple of references and examples that will make it easier for you to get started. The first reference points to a simpler scenario that will allow you to use Cloudify to deploy your Big Data database as a service. The other three references are full-application stack deployments that include the data-processing and web-tier of applications managed together with the Big Data database.
Running a Big Data 'Database as a Service'
Cloudify comes with built-in recipes for Cassandra and MongoDB, as well as Solr (popular search engine), which makes it easy to deploy these database clusters on your local machine, data center or private/public cloud through a single command. In this way, you can use Cloudify to automate the database.
Spring Travel application with Cassandra
Demonstrating a deployment of a JAVA-based commerce application with Cassandra as the database
The example includes recipes that provision the Cassandra database, create a schema, load the data, then spawn a Tomcat container, automatically injecting the reference to Cassandra, that then shows the custom management and monitoring of the application - all through a single command. A recent videocast showing how the travel application works on HP OpenStack cloud is available here.
Pet Clinic example with MongoDB
The Pet Clinic example does pretty much the same thing only using a sharded MongoDB cluster.
Twitter Real Time analytics example for Big Data
The Twitter example shows how you can attach real time stream-based processing for handling real live Twitter feeds, and how you can manage both the stream processing cluster and the Big Data cluster using Cloudify and run it on any cloud. The entire source code for this example is provided on Github.
Give it a try
To try out any of the examples you'll need to download Cloudify (latest) or (stable) build. Cloudify comes with all first three examples as part of the distribution under the recipes or examples directory. To try out these examples simply follow the steps from the Cloudify Quick Start Guide.
Earlier this month Zynga announced its move from Amazon AWS to its own private Z-Cloud. Sony also started to move increasing parts of its workload from Amazon to Rackspace OpenStack.
There isn't so much in common between these different use cases, except for the fact that they may indicate the beginning of a trend (I’ll get back to that toward the end) where companies start to take more control over their cloud infrastructure.
So what really brought Zynga and Sony to make such a move?
Zynga moves from Amazon to their own private cloud
Zynga ran their gaming services on Amazon for a while. It was noted that the cost of running these games on Amazon cost Zynga $63M annually. This cost and the continuous increase of their workload forced Zynga’s management to look for ways to control their cloud operational costs. Zynga realized that at that scale, they are just better off building their own cloud infrastructure. By doing so, they could also optimize the cloud infrastructure for their own needs and reduce the operational cost margins substantially as a result.
According to Zynga their private cloud operation is reported to increase utilization by 3x, which means that they will need 1/3 of the servers that they would need from Amazon for the same workload, as noted by CTO Allan Leinwand on Zynga’s engineering blog:
For social games specifically, zCloud offers 3x the efficiency of standard public cloud infrastructure. For example, where our games in the public cloud would require three physical servers, zCloud only uses one. We worked on provisioning and automation tools to make zCloud even faster and easier to set up. We’ve optimized storage operations and networking throughput for social gaming. Systems that took us days to set up instead took minutes. zCloud became a sports car that’s finely tuned for games.
What's interesting in this case, is the cost analysis. Quite often when we measure the cost of running machines, we don't measure what would be the cost of running a particular workload, which is a combination of many factors, not just the cost of servers.
The other thing that is interesting with Zynga’s move is that they didn't make the move just to cut costs. The move was part of a more strategic direction to create a gaming platform for additional games. Apparently when your service becomes a platform, controlling your infrastructure becomes more strategic than just the cost factor. It’s a big differentiator that can put Zynga in a completely different ball game than its competitors; it will also enable them to have better control over the dependency on Facebook and build their own independent ecosystem.
According to Dave Wehner, CFO at Zynga, the company will lower its cost of revenue over the next 18 to 24 months as third-party hosting costs decrease. Wehner said that Zynga plans to “roll off the majority of those third-party hosting arrangements.” Zynga’s capital expenditures in the fourth quarter were $50 million, down from $63 million in the third quarter. Most of that spending is focused on zCloud. For 2011, capital investments were $238 million.
The building of its own infrastructure will help the bottom line. Zynga can depreciate its gear and lower quarterly expenses. “We believe this investment will have a short payback period and enable us to expand gross margins in the long term,” said Wehner
The important thing to note is that Zynga’s move wasn't because of Amazon’s failure as it first reads. It’s more of a natural evolution and maturity cycle of the company.
Sony move from Amazon to Rackspace/OpenStack
Sony’s gaming arm faced a number of security breaches last year, as reported by NetworkWorld, which compromised the personal identity information of millions of players within Sony’s gaming network. This event forced Sony to look for ways to have better control of their infrastructure exposure. Their approach was to move some of their workload to Rackspaces’ OpenStack.
By splitting their operations between Amazon and Racksapce they are able to have better control of a particular cloud failure. They are also better positioned to control their cloud infrastructure costs by reducing their dependency on a particular cloud provider, thus being in a stronger position to negotiate their cloud operational costs.
We’re often taught that infrastructure is a commodity that we shouldn't care much about, and should essentially outsource everything. It turns out, that as the cost of infrastructure get bigger, and as our needs become more unique, controlling the infrastructure becomes a critical part of our business. Controlling your infrastructure could mean having your own private cloud as is the case with Zynga, or minimizing your dependency on a particular cloud provider as is the case with Sony. The good news is there are enough free OSS tools out there such as Chef, Puppet, and Cloudify that can help to reduce the complexity that is often involved with such a move.