I'm getting a lot of questions lately about the use of MapReduce: how it
compares with other technologies such as Grid, and how the the different
solutions that claims support for MapReduce (GigaSpaces included) fit into the
puzzle. A good starting point is the intense
discussion on the cloud computing mailing list under the topic: "Is
Map/Reduce going mainstream?" where I contributed some of my own thoughts
on the topic.
To summarize the questions on this topic, I'd state it as follows:
How can we reduce the
barrier-to-entry for implementing MapRreduce specifically, and parallel
processing applications in general?
Many of the
use cases for MapReduce represent some sort of data analytic application.
But can MapReduce be used as a generic parallel processing mechanism?
Specifically, is it suited to deal with issues such as data affinity,
asynchronous batch processing, etc.?
In this post I'll try to answer these questions, but first, a few
clarifications:
What is MapReduce?
Quoting from the Wikipedia definition
MapReduce
is a software framework
introduced by Google to support
parallel computations over large (multiple petabyte[1])
data sets on clusters
of computers.
Why do we need a new model for processing large data sets?
Unlike central
data-sources, such as a database, you can't assume that all the data resides in
one central place, and therefore, you can't just execute a query and expect to
get the result as a synchronous operation. Instead, you need to execute the
query on each data-source, gather the results and perform a 2nd-level
aggregation of all the results. To speed the time it takes to run this entire
process, the query needs to be done in parallel on each data source. The
process of mapping a request from the originator to the data source is called
"Map"; and the process of aggregating the results into a consolidated
result is called "Reduce".
MapReduce implementations
Hadoop is the most well-known MapReduce implementation.
Hadoop is an open source project that implements the exact
spec defined by Google in Java. As such, it was designed primarily to enable
MapReduce operations on distributed file systems and was not really designed as
a general purpose parallel processing mechanism.
The
wikipedia entry on MapReuce (http://en.wikipedia.org/wiki/MapReduce) has
references to other implementations in other languages, including Greenplum, Skynet,
and Disco.
Other forms of MapReduce implementations
Over time, the term MapReduce has expanded in definition to describe a
more general purpose pattern for executing parallel aggregation of distributed
data-sources, rather than referring to a specific type of implementation.
GigaSpaces, GridGain, and to a degree, Terracotta, all took a different
approach than Hadoop in their MapReduce implementations. Rather than
implementing the exact Google spec in Java, these three aimed to take advantage
of the Java language and make the implementation of the MapReduce pattern simpler
to the average programmer (I'll get back to that later).
How MapReduce differs from other grid implementations?
Compute Grid
While MapReduce represents one form of parallel processing for
aggregating data from distributed data sets, it is not the only one. "Compute Grid" is a
term used to define another form of parallel processing, used mostly to compute
intensive batch processing. A typical batch processing takes a long-running
Job, breaks it into small tasks and enable the execution of those tasks in
parallel to reduce the time it takes to execute the job (Compared with the time
it would have taken to execute the tasks sequentially). This model is a good
fit for executing relatively compute-intensive and stateless jobs. A typical
scenario for this would be a Monte Carlo
simulation, such as the one used to perform risk analysis reports in the
financial industry. This type of analysis is more compute-intensive than
data-intensive. Most compute-grid implementations have the following components:
Scheduler
Job executor
Compute agent
The executor submits jobs. The scheduler is responsible for taking the
job, splitting it into a set of small tasks (this process requires specific
application code) that are sent in parallel based on a certain policy to a set
of compute nodes. The agents on each compute node execute those tasks. The
results of those tasks are aggregated back to the scheduler.
The scheduler is responsible for monitoring and ensuring the execution
of the tasks. The scheduler was designed to support advanced execution
policies, such as priority-based execution as well as advanced life-cycle
management.
Master/Worker Pattern (simple Compute Grid)
The Master/Worker pattern is a simplified version of parallel batch
execution, based on the Tuple
Space model. Tuple Spaces emerged from the Linda project at Yale
university. JavaSpaces is the main Java implementation of the model. A good
description of this model is provided in this
article. In a master/worker pattern, tasks are assumed to be evenly
distributed across worker machines. In this case there is no need for an
intermediate scheduler. Load-balancing is achieved through a polling mechanism.
Each worker polls the tasks and executes them when it's ready. If a worker is
busy, it simply won't process the tasks, and if it is free it will poll the
pending tasks and process them. Consequently, Workers running on a more
powerful machine will process more tasks over time. In this way, load balancing
is implicit, supporting simple task distribution models. For this reason,
master/worker implementations tend to be more useful for simple compute-grid
applications.The fact that there is no need for an explicit scheduler makes
master/worker more performant and better suited for cases where latency is an
important factor.
MapReduce & Compute Grid: Summary
Although both Map/Reduce and Compute Grids provide a parallel processing
model for solving a large- scale problems, they are each optimized for addressing
a different kind of problem. MapReduce was designed to address shortening the
time it takes to process complex data-anlytics scenarios. The results of the
processing need to be returned in real-time, as the originator of the task
normally blocks until its completion. Compute Grid applications are aimed at
speeding-up the time it takes to process complex computetional tasks. the Job
is executed as a background process that can often run for a few hours. Users
don't typically wait for the results of these tasks, but are either notified or
poll for the results. With MapReduce, the application tends to be
data-intesive, therefore scalability is driven mostly by the ability to scale
the data through paritioning. Executing the tasks close to the data becomes
critical in this scenario. Compute Grid applications tend to be stateless, and
normally operate on relatively small data-sets (compared with those of
MapReduce). Consequently, data affinity is considered an optimization rather
than a necessity.
When to use MapReduce, Compute Grids and Master/Worker?
In reality, most compute-intensive application are not purely
stateless. To execute the tasks the compute tasks need to process data that is
coming from either a database or a file system. In small scale applications, it
is common practice to distribute the data with the job itself. In large scale
compute-grid applications, however, passing the data with the job can be quite
inefficient. In such cases, it is recommended to use a combination of Compute
and Data Grid. In this case, the data is stored in a shared data-grid cluster
and passed by reference to the compute task. So we see the need for a
combination of Compute and Data Grids becoming more common.
Too many options? Feeling confused?
At this point you may be scratching your head wondering whether or not
your application falls precisely in any of the above categories.
A quick reality check will reveal that many existing applications
consist of a variety of the above scenarios, mixed with traditional
client-server models.
In such cases, attempting to use a different product for each scenario
in our application is going to make things extremely complex.
How do we make distributed programming like MapReduce simple?
This question has been the driving force for many of our recent
development efforts.
To simplify things, we realized that we need to:
Grid enable
existing programming models -- Use abstraction and virtualization
techniques to introduce parallel processing as part of a normal
client/server programming model.
- Reduce the
amount of frameworks -- Provide a common model for using both parallel
computing models: batch (compute-intensive) and real-time aggregation
(data-intensive).
- Make
data-awareness implicit with all APIs -- In reality, most application are
stateful to some degree, so we need to make data-awareness implicit within
our API and not as an afterthought. External integration solutions tend
lead to complexity.
Where does GigaSpaces fit in?
GigaSpaces emerged from the tuple space model, specifically
JavaSpaces, and was one of the first implementations of the Master/Worker
pattern. At a later stage, we extended our JavaSpaces implementation to a full
IMDG (In-Memory Data-Grid). In large scale compute grid applications, the
GigaSpaces Data-Grid is often used in conjunction with other Compute-Grid
implementations, either commercial or open source. This puts GigaSpaeces in a
unique place, providing data-grid and data-aware compute grid capabilities
using the same architecture. We also provide built-in integration of our
Data-Grid with more advanced Enterprise Compute Grid products, such as those from
DataSynapse and Platform Computing.
As of version 6.0, we offered abstraction layers (referred to as the Service Virtualization Framework
or SVF) that take advantage of our existing space-based implementation in a way
that doesn't require a complete re-write or a steep learning curve for
developers who have already written their business logic as SessionBeans,
Spring Remoting, RMI, CORBA, SOAP and other common Client/Server programming
models. Our aim was to make distributed programming simple to the average
programmer. We achieved this goal by following the same principles that I laid
out above. For example, we introduced a set of abstractions on top of our
space-based implementation. As we support both data distribution and task
distribution, we are able to reduce the number of required frameworks and
runtime components, as well as avoid the need for external services to ensure
data affinity. In addition, we extended our support for aggregated MapReduce
queries using a new Executor
framework. With this we can support MapReduce and batch processing using
the Master/Worker pattern and the *same* consolidated programming model.
The idea behind all this is to make scale-out development
simple by making the API as close as possible to prevailing programming models,
and by reducing the number of products and components required to scale either
data-intensive or compute-intensive applications.
Final notes
The emergence of MapReduce specifically, and Grid computing in
general, creates a need for another type of programming model currently missing
in most existing mainstream frameworks and products. So far the solution has
been to provide different specialized frameworks to to address each need. The
fact that we have so many different frameworks (MapReduce included) makes
things more complex.
On
the Cloud Computing mailing list, Chris K Wensel wrote the following comment:
'thinking' in MapReduce sucks. If you've ever read "How to
write parallel programs" by Nicholas Carriero and David Gelernter (http://www.lindaspaces.com/book/),
many of their thought experiments and examples are based on a house building
analogy. That is, how would you build a house in X model or Y model. These
examples work because the models they present are straightforward.......If
companies like Greenplum are using MapReduce as an underlying compute model,
they must offer up a higher level abstraction that users and developers can
reason in.
Indeed
Making MapReduce part of mainstream development requires a higher level
abstraction. The high level abstraction needs to provide means to use existing
programming models on top of MapReduce to shorten the learning curve and
transition from existing applications to distributed scale-out applications.
Having said that, this is not enough, as we're still going to end up with
multiple frameworks for addressing various parallel programming models that are
not covered with MapReduce, such as Compute Grids and batch processing. It is
therefore critical to map those different models into a coherent and consistent
model that would support all various programming semantics, including
MapReduce, Master/Worker and batch processing, in addition to the classic
Client/Server model, with the ability to smoothly transition among them,
without the need to switch or integrate different frameworks for each, and
without the need to write our business logic in a completely differently way
for each.
Recent Comments