I'm getting a lot of questions lately about the use of MapReduce: how it compares with other technologies such as Grid, and how the the different solutions that claims support for MapReduce (GigaSpaces included) fit into the puzzle. A good starting point is the intense discussion on the cloud computing mailing list under the topic: "Is Map/Reduce going mainstream?" where I contributed some of my own thoughts on the topic.
To summarize the questions on this topic, I'd state it as follows:
How can we reduce the barrier-to-entry for implementing MapRreduce specifically, and parallel processing applications in general?
Many of the use cases for MapReduce represent some sort of data analytic application. But can MapReduce be used as a generic parallel processing mechanism? Specifically, is it suited to deal with issues such as data affinity, asynchronous batch processing, etc.?
In this post I'll try to answer these questions, but first, a few clarifications:
What is MapReduce?
Quoting from the
Why do we need a new model for processing large data sets?
data-sources, such as a database, you can't assume that all the data resides in
one central place, and therefore, you can't just execute a query and expect to
get the result as a synchronous operation. Instead, you need to execute the
query on each data-source, gather the results and perform a 2nd-level
aggregation of all the results. To speed the time it takes to run this entire
process, the query needs to be done in parallel on each data source. The
process of mapping a request from the originator to the data source is called
"Map"; and the process of aggregating the results into a consolidated
result is called "Reduce".
Unlike central data-sources, such as a database, you can't assume that all the data resides in one central place, and therefore, you can't just execute a query and expect to get the result as a synchronous operation. Instead, you need to execute the query on each data-source, gather the results and perform a 2nd-level aggregation of all the results. To speed the time it takes to run this entire process, the query needs to be done in parallel on each data source. The process of mapping a request from the originator to the data source is called "Map"; and the process of aggregating the results into a consolidated result is called "Reduce".MapReduce implementations
Hadoop is the most well-known MapReduce implementation.
Hadoop is an open source project that implements the exact spec defined by Google in Java. As such, it was designed primarily to enable MapReduce operations on distributed file systems and was not really designed as a general purpose parallel processing mechanism.
Other forms of MapReduce implementations
Over time, the term MapReduce has expanded in definition to describe a more general purpose pattern for executing parallel aggregation of distributed data-sources, rather than referring to a specific type of implementation. GigaSpaces, GridGain, and to a degree, Terracotta, all took a different approach than Hadoop in their MapReduce implementations. Rather than implementing the exact Google spec in Java, these three aimed to take advantage of the Java language and make the implementation of the MapReduce pattern simpler to the average programmer (I'll get back to that later).
How MapReduce differs from other grid implementations?
While MapReduce represents one form of parallel processing for aggregating data from distributed data sets, it is not the only one. "Compute Grid" is a term used to define another form of parallel processing, used mostly to compute intensive batch processing. A typical batch processing takes a long-running Job, breaks it into small tasks and enable the execution of those tasks in parallel to reduce the time it takes to execute the job (Compared with the time it would have taken to execute the tasks sequentially). This model is a good fit for executing relatively compute-intensive and stateless jobs. A typical scenario for this would be a Monte Carlo simulation, such as the one used to perform risk analysis reports in the financial industry. This type of analysis is more compute-intensive than data-intensive. Most compute-grid implementations have the following components:
The executor submits jobs. The scheduler is responsible for taking the job, splitting it into a set of small tasks (this process requires specific application code) that are sent in parallel based on a certain policy to a set of compute nodes. The agents on each compute node execute those tasks. The results of those tasks are aggregated back to the scheduler.
The scheduler is responsible for monitoring and ensuring the execution of the tasks. The scheduler was designed to support advanced execution policies, such as priority-based execution as well as advanced life-cycle management.
Master/Worker Pattern (simple Compute Grid)
The Master/Worker pattern is a simplified version of parallel batch execution, based on the Tuple Space model. Tuple Spaces emerged from the Linda project at Yale university. JavaSpaces is the main Java implementation of the model. A good description of this model is provided in this article. In a master/worker pattern, tasks are assumed to be evenly distributed across worker machines. In this case there is no need for an intermediate scheduler. Load-balancing is achieved through a polling mechanism. Each worker polls the tasks and executes them when it's ready. If a worker is busy, it simply won't process the tasks, and if it is free it will poll the pending tasks and process them. Consequently, Workers running on a more powerful machine will process more tasks over time. In this way, load balancing is implicit, supporting simple task distribution models. For this reason, master/worker implementations tend to be more useful for simple compute-grid applications.The fact that there is no need for an explicit scheduler makes master/worker more performant and better suited for cases where latency is an important factor.
MapReduce & Compute Grid: Summary
Although both Map/Reduce and Compute Grids provide a parallel processing model for solving a large- scale problems, they are each optimized for addressing a different kind of problem. MapReduce was designed to address shortening the time it takes to process complex data-anlytics scenarios. The results of the processing need to be returned in real-time, as the originator of the task normally blocks until its completion. Compute Grid applications are aimed at speeding-up the time it takes to process complex computetional tasks. the Job is executed as a background process that can often run for a few hours. Users don't typically wait for the results of these tasks, but are either notified or poll for the results. With MapReduce, the application tends to be data-intesive, therefore scalability is driven mostly by the ability to scale the data through paritioning. Executing the tasks close to the data becomes critical in this scenario. Compute Grid applications tend to be stateless, and normally operate on relatively small data-sets (compared with those of MapReduce). Consequently, data affinity is considered an optimization rather than a necessity.
When to use MapReduce, Compute Grids and Master/Worker?
If you need to agregate data that resides in a distributed file system then I would recommend the use of Hadoop and the like.
If you need to agregate data that resides in other data sources, such as an in-memory data-grid (IMDG), you should consider GigaSpaces, or a combination of compute grid and data grid products.
If your application is compute-intensive and relatively stateless in nature â€" you should consider the classic Compute Grid implementations.
If you're looking for a real-time (or near-real-time) and lightweight compute-intensive application, you should consider Master/Worker implementations
In reality, most compute-intensive application are not purely stateless. To execute the tasks the compute tasks need to process data that is coming from either a database or a file system. In small scale applications, it is common practice to distribute the data with the job itself. In large scale compute-grid applications, however, passing the data with the job can be quite inefficient. In such cases, it is recommended to use a combination of Compute and Data Grid. In this case, the data is stored in a shared data-grid cluster and passed by reference to the compute task. So we see the need for a combination of Compute and Data Grids becoming more common.
Too many options? Feeling confused?
At this point you may be scratching your head wondering whether or not your application falls precisely in any of the above categories.
A quick reality check will reveal that many existing applications consist of a variety of the above scenarios, mixed with traditional client-server models.
In such cases, attempting to use a different product for each scenario in our application is going to make things extremely complex.
How do we make distributed programming like MapReduce simple?
This question has been the driving force for many of our recent development efforts.
To simplify things, we realized that we need to:
Grid enable existing programming models -- Use abstraction and virtualization techniques to introduce parallel processing as part of a normal client/server programming model.
- Reduce the amount of frameworks -- Provide a common model for using both parallel computing models: batch (compute-intensive) and real-time aggregation (data-intensive).
- Make data-awareness implicit with all APIs -- In reality, most application are stateful to some degree, so we need to make data-awareness implicit within our API and not as an afterthought. External integration solutions tend lead to complexity.
Where does GigaSpaces fit in?
GigaSpaces emerged from the tuple space model, specifically JavaSpaces, and was one of the first implementations of the Master/Worker pattern. At a later stage, we extended our JavaSpaces implementation to a full IMDG (In-Memory Data-Grid). In large scale compute grid applications, the GigaSpaces Data-Grid is often used in conjunction with other Compute-Grid implementations, either commercial or open source. This puts GigaSpaeces in a unique place, providing data-grid and data-aware compute grid capabilities using the same architecture. We also provide built-in integration of our Data-Grid with more advanced Enterprise Compute Grid products, such as those from DataSynapse and Platform Computing.
As of version 6.0, we offered abstraction layers (referred to as the Service Virtualization Framework or SVF) that take advantage of our existing space-based implementation in a way that doesn't require a complete re-write or a steep learning curve for developers who have already written their business logic as SessionBeans, Spring Remoting, RMI, CORBA, SOAP and other common Client/Server programming models. Our aim was to make distributed programming simple to the average programmer. We achieved this goal by following the same principles that I laid out above. For example, we introduced a set of abstractions on top of our space-based implementation. As we support both data distribution and task distribution, we are able to reduce the number of required frameworks and runtime components, as well as avoid the need for external services to ensure data affinity. In addition, we extended our support for aggregated MapReduce queries using a new Executor framework. With this we can support MapReduce and batch processing using the Master/Worker pattern and the *same* consolidated programming model.
The idea behind all this is to make scale-out development simple by making the API as close as possible to prevailing programming models, and by reducing the number of products and components required to scale either data-intensive or compute-intensive applications.
The emergence of MapReduce specifically, and Grid computing in general, creates a need for another type of programming model currently missing in most existing mainstream frameworks and products. So far the solution has been to provide different specialized frameworks to to address each need. The fact that we have so many different frameworks (MapReduce included) makes things more complex.
the Cloud Computing mailing list, Chris K Wensel wrote the following comment:
'thinking' in MapReduce sucks. If you've ever read "How to write parallel programs" by Nicholas Carriero and David Gelernter (http://www.lindaspaces.com/book/), many of their thought experiments and examples are based on a house building analogy. That is, how would you build a house in X model or Y model. These examples work because the models they present are straightforward.......If companies like Greenplum are using MapReduce as an underlying compute model, they must offer up a higher level abstraction that users and developers can reason in.
Indeed Making MapReduce part of mainstream development requires a higher level abstraction. The high level abstraction needs to provide means to use existing programming models on top of MapReduce to shorten the learning curve and transition from existing applications to distributed scale-out applications. Having said that, this is not enough, as we're still going to end up with multiple frameworks for addressing various parallel programming models that are not covered with MapReduce, such as Compute Grids and batch processing. It is therefore critical to map those different models into a coherent and consistent model that would support all various programming semantics, including MapReduce, Master/Worker and batch processing, in addition to the classic Client/Server model, with the ability to smoothly transition among them, without the need to switch or integrate different frameworks for each, and without the need to write our business logic in a completely differently way for each.