Architecture

April 14, 2009

Designing a Scalable Twitter

Guy Nirpaz, Uri Cohen and Shay Banon came up with an interesting exercise as part of the recent partner training that took place at the GigaSpaces office. In this exercise, the students were asked to come up with a scalable design for Twitter, using Space-Based Architecture.

There are some interesting scalability lessons from this exercise, which are applicable to anyone looking to implement new-style real-time web applications such as the ones used for social networking.

In this post I'll  try to summarize the main patterns to put into place and considerations to make when designing such a scalable architecture.

Background:

For those of you who are not yet familiar with the service, Twitter is sort of a SMS-service meets discussion board.  You can post short messages (up to 140 characters) that can be shared with a group of subscribers that are referred to as "followers". The main difference between twitter and other messaging applications is that both SMS and Instant Messaging (IM) applications were designed primarily for one-on-one communications whereis Twitter was designed primarily for broadcast communications (publish/subscribe, or pub/sub). Another aspect that is special about Twitter is that by default anyone can follow anyone else. In other words, it was designed for open communications, not private, as were IM and SMS.

What are Twitter's scalability challenges?

1. Sending a tweet (a message on Twitter is known as a 'tweet') -– The challenge is how to handle an ever-growing volume of tweets and re-tweets and responses that can lead to a viral "message storm"

2. Reading tweets – The challenge is how to handle a large number of concurrent users that continually “listen” for tweets from users (or topics) they follow.

Designing A Scalable Twitter

Choosing the right scalability patterns

Almost every challenge in software architecture has its roots in one of the existing patterns. So the simplest course is to start by looking for those patterns, and choosing the right patterns to scale the application. Looking at many other scalable architectures, we'll begin with a partitioning pattern as the core design principle. By partitioning our Twitter-like application we'll spread the load across a cluster of servers and scale by simply adding more servers (i.e., partitions).  Another important architectural observation about Twitter is that it doesn’t fit into the classic database-centric design that most web applications do. On the flip side, it doesn’t fit well with a messaging-centric design (pub/sub) either. It is a combination of the two.

A pattern that is suitable for this type of collaborative messaging is known as a blackboard pattern.  In our design, we will use those two design patterns -- partitioning and blackboard -- as the foundation for our scalable Twitter application. With the foundation in place, let’s list the requirements and examine how these patterns can be used to scale the app.

Scalability Requirements

We'll assume a relatively extreme scaling requirement:

  • Tweet Volume: 10 billion tweets per day
  • Tweet Storage: 100 Gigabytes per day (with 10:1 compression)

Additional assumptions:

  • Tweets are limited to 140 characters
  • Tweets are immutable, i.e., there are no updates, only inserts
  • Twitter limits client applications to 70 requests per hour

Now that we have the foundational patterns and clear requirements, we can design the architecture. We'll start first with the blackboard system.

Using an In-Memory Data Grid (IMDG) as a Blackboard System

The are several approaches to building a blackboard system. To maximize performance and scalability, we'll store the data in memory, thus avoiding disk I/O, which is often the main cause for contention. For years, Java has provided a model for designing blackboard systems known as JavaSpaces. More recently, distributed caching has become popular and can provide similar capabilities to those of JavaSpaces. Let's examine two popular distributed caching approaches for our blackboard system:

  1. Simple read-mostly caching using memcached
  2. Read/write caching, also known as an In-Memory Data Grid (IMDG)

Choosing between memcached and an IMDG

Memcached enables us to to store the data (tweets) in a distributed memory set and read it in a scalable fashion. Having said that, be aware that memcached is not transactionally-safe and is not designed for reliability (i.e., it doesn’t support fail-over and high availability). That means that if we use memcached or something similar, we will have to use a database as the back-end. Every tweet posted will have to be written to both memcached and the database in a synchronous fashion to ensure that no tweet will be lost. This approach may be good enough for scaling read access, however, for writes and updates it offers limited scalability.

Unlike memcached, which was designed for simple read-mostly caching, In-Memory Data Grids  are designed for handling a read/write scenario, and can therefore act as the system-of-record for both write and read operations. We can still use a database for long-term persistence, but because the IMDG maintains its reliability purely in memory, we can write and update the database asynchronously and avoid hitting the database bottleneck.

Todd Hoff, author of highscalability.com wrote an interesting summary that covers the different products in this space in a recent post:  Are Cloud Based Memory Architectures the Next Big Thing?

Todd provide a clear explanation of how an IMDG works (using GigaSpaces):


Nati blog 1 (2)
Natiblog 2 (2)  

  • A POJO (Plain Old Java Object) is written through a proxy using a hash-based data routing mechanism to be stored in a partition on a Processing Unit. Attributes of the object are used as a key. This is straightforward hash based partitioning like you would use with memcached.
  • You are operating through GigaSpace's framework/container so they can automatically handle things like messaging, sending change events, replication, failover, master-worker pattern, map-reduce, transactions, parallel processing, parallel query processing, and write-behind to databases.
  • Scaling is accomplished by dividing your objects into more partitions and assigning the partitions to Processing Unit instances which run on nodes-- a scale-out strategy. Objects are kept in RAM and the objects contain both state and behavior. A Service Grid component supports the dynamic creation and termination of Processing Units.

Back to our Twitter app: Given the scalability requirements, we will need to scale both reads and writes, and therefore, an IMDG is a more suitable approach to implementing the blackboard system.

Now let’s examine how the use of an IMDG as the blackboard system enables us to scale both sending and reading tweets. Let's start by designing the partitioned cluster.

Designing a partition architecture

One of the main considerations in designing a partition cluster of any kind is determining the partition key, such as a Customer ID in a CRM application or a Trade ID in a trading application. At first glance, it sounds like a trivial decision, but choosing the right partitioning key requires a deep understanding of the application usage patterns and data model.  In the case of Twitter, we could choose to partition the application by the data-type, the user, the tweet itself or the followers. Our first goal is selecting a key that will that will be granular enough to enable scaling the application just by adding more partitions, while making sure that we don't end up with a key that is too fine-grained -- making it sub-optimal for querying purposes.

If we use the timestamp key, for example, our application will be optimized for “inserts” (writes), however, even a simple query such “retrieve the tweets of a certain user” will force us to execute an aggregated query against all partitions. Alternatively, if we partition the data based on user-id, we'll be able to easily spread the load from different users across partitions. Retrieving the tweets of a certain user is going to be resolved in one call to a single partition. We may encounter a problem if a single user generates a significant higher load than average, however, in the case of Twitter, we can assume that this is not very likely. Partitioning by user-id is a good compromise.

Data capacity analysis

With such extreme requirements it is clear that storing all tweets in memory is going to require huge memory capacity. Very quickly this will become economically prohibitive, so we need to devise a scheme in which the IMDG acts as a buffer for most of the load on the system, and then offloads the data and queries to an underlying persistent storage.  In our Twitter example, it is fair to assume that most real-time queries (those that require fast access to the data) will be resolved in data from the last hour or 24 hours. Queries that require older data will need to hit the database for the initial call. However, subsequent access to fetch new updates should be resolved purely in-memory.

Using this approach, we'll need about 10 servers, each holding 10GB of data in memory to accommodate 24 hours of activity. If we also want to back up the data in memory, we will need double the amount of servers.

Choosing the right eviction policy

It's reasonable to assume that recent data is accessed most and older data is rarely used. To ensure that we get the maximum hit ratio on our memory front-end, let's choose a time-based eviction policy, which always holds the most recent updates in memory. When we will reach our memory capacity limit the oldest data will automatically get evicted from memory. The actual window of time in which we will be able to keep in memory is obviously dependent on the size of the cluster. With an IMDG implementation all tweets are stored in a persistent storage, which means that when tweets are evicted they are not deleted from the system.

Scaling tweet writes:

If we select user-id as the partitioning key, each user tweet will be sent to a specific partition. Multiple users may be routed to the same partition. Usually the algorithm to determine which partition fits a certain user is something like:

routing-key.hashCode() % #of partitions

In GigaSpaces, this is done by marking the routing attribute of our tweet class with an @SpaceRouting annotation.

The web front-end application will call space.write( new Tweet(..),..)  to send the tweets. This way there is nothing in our web client code that exposes the fact that the underlying implementation interacts with a cluster of partitions (spaces in GigaSpaces). Those details are abstracted within the space proxy. When the write method is called on the space proxy it parses the field that matches @SpaceRouting from our Tweet() object and uses this field value to calculate the partition it belongs to. It then uses that value to route the Tweet(..) object to the appropriate partition.

With this approach, the web application can be written in a very simple way and can interact with the entire cluster as if it was a single server.

Natiblog 3

The data from the memory partitions gets stored asynchronously into a persistent storage. The persistent storage could be a database, but it could also be other things, such as an index search engine based on Compass/Lucene.

Scaling tweet reads:

To those familiar with messaging system, at first glance Twitter looks like a classic publish subscribe application. A closer look, however, reveals that any attempt to implement Twitter with something like a JMS message queue is going to fail in achieving a scalable system. This is especially true if you consider that the system needs to maintain a durable queue for each user. That could easily lead to a scenario in which each tweet is published to thousands of subscribers and every re-tweet can potentially lead to a "message storm".

As I discuss above, the right way to think about this type of application is as a blackboard pattern, just as a blackboard (or these days, a whiteboard) is used by a group of people (followers, in the case of Twitter) to share information and collaborate. When someone writes something on the board, everyone sees it and can choose to react. Unlike messaging (take email for example), we don’t need to send separate messages to each subscriber. Instead everyone is looking at the same board. Everything is also copied from the board to paper. When the board runs out of space, we erase it. And we can always page through the paper copy to access the board history. 

In Twitter, this means that each follower that follows a group of people is basically polling for messages posted by those users from the last time he read them. To make things more tangible we can express this type of query with the following SQL syntax:

SELECT * FROM Post WHERE UserID=<id> AND PostedOn > <from date>.

The <from date> will normally be the last few minutes, if we're constantly looking for new messages.

But there's a caveat. Remember that we partitioned the application by user-id? This means that each user's tweets are stored in a separate partition. How can we read all users' posts? If we poll for each user individually, we will end up with a lot of network calls. The simplest approach would be to execute one call that looks for ALL the users we're following and look for updates (new tweets) from those users. The pattern we'll use to perform such this task is mapreduce. One way to do that with GigaSpaces is through the distributed task API:

Nati blog 4

The distributed task API is a modern version of the stored procedure. The following snippet shows what such a call would look like:

AsyncFuture<Long> future = gigaSpace.execute(new GetTweetsUpdates());
long result = future.get(); // result will be the number of primary spaces

The GetTweetsUpdates() class contains code that will be injected in each partition and will enable us to look for updates from the users we follow in a single call. Because the call runs in-process, and because the data is stored in-memory, executing such a task is extremely fast compared with the equivalent with database and stored procedure operations. Execution is aggregated to the caller implicitly. The caller can use a reducer to aggregate the results into a single result object.

Scaling the web front-end

Nothing really new here. We'll use a classic web front-end, which is comprised of a load-balancer and a cluster of web servers that act as a front end to our IMDG instances. The web application will use a single cluster-aware IMDG proxy to send new tweet posts. The IMDG proxy will be responsible for mapping the tweet with the actual partition that hosting the tweet. That logic is kept completely out of the application code. This allows us to keep our web tier clean and simple.

Keeping the web layer stateless to avoid session stickiness

One common pattern for keeping the web tier scalable is to use a Shared-Nothing Architecture, which basically means that the web tier will be stateless. This requires keeping the user session state external to the web-tier. As previously demonstrated, the IMDG can be used as high-performance, scalable data store for maintaining shared session state information. This allows us to avoid session stickiness and to scale the web tier without being locked in to a specific server throughout the entire session, in case the server is over-loaded.

For more information on how to scale the web tier, as well as other important capabilities such as self-healing and auto-scaling, see the following tutorial: Scaling Your Web Application.

Making it simple and cost-effective using cloud computing

Twitter is yet another example for a situation in which system load is highly variable and the difference between average load and peak load can be quite significant. In such cases, provisioning our system can be fairly hard and costly. This is where cloud computing and SLA-driven deployments can help us scale on demand and pay only for what we use.

Once we figured out a way to partition the application, it's going to be much simpler to package the application into self-sufficient units (referred to in GigaSpaces as processing-units) and scale the application simply by adding or removing these units on demand. You can learn more about this here

Final words

Scaling a real-time web application such as Twitter or Facebook introduces unique challenges that are are quite different from those of a "classic" database-centric application. The most profound difference is the fact that unlike with traditional sites, Twitter is a heavy read/write application, and not read-mostly. This seemingly minor difference can break most existing models for web application scalability. Using a combination of memcached + MySQL is not going to cut it for this type of application. 

The good news is that with the right patterns and set of tools, building a scalable architecture that meets such challenges isn’t that difficult.  There are already plenty of success stories that demonstrate that, such as the following example from highscalability.com: Handle 1 Billion Events Per Day Using a Memory Grid

The proposed architecture is by no means perfect and can be further optimized to meet even better performance and latency, but that will come at the cost of simplicity. I believe that the proposed architecture should get you pretty far as-is. Avoid going through more advanced optimizations until the point they are an absolute must.


References

January 22, 2009

Saving cost using Application/Middleware virtualization

Earlier this week, I gave a joint webinar with James Liddle, where we outlined
practical guidelines for saving costs using middleware and application-level Virtualization:

  • Saving the cost of peak/static provisioning using on-demand scaling
  • Saving the downtime cost
  • Saving costs through outsourcing part of our application and operations to the cloud
  • Saving costs using application level optimization (doing more with less)
  • Saving costs using platform consolidation to reduce the number of software components as well as utilize OpenSource and more commodity Software packages

Towards the end (Slide 20), Jim Liddle presented real life case studies from the iPhone launch in the UK and how some of hose principles have been applied to enable a successful launch in the UK.

Additionally, Jim went through some of the motivations and case studies that led different Telco, Online Gaming and Start-up companies to utilize our cloud, offering to gain better cost effectiveness.

For those who did not have the chance to participate in the webinar, we uploaded a recorded version of that presentation for you to view.





I would like to point out a few specific items from this presentation:

Beyond Server Side Consolidation (Slide 5)

Server-side-consalidation
Server-Side-Consolidation (SSC) played an important role in bringing the concept of virtualization to mass adoption. It forced organization to be able to map their application into concrete packages and look at machines as a logical entity rather then just a physical entity. Server-Side-Consolidation also brought a relatively simple model for cost saving: Instead of running applications on dedicated HW you can consolidate them into one machine. By doing so, you can reduce the number of servers and save the hardware and the operational costs associated with that.

Having said that, SSC is only one *very narrow* aspect of virtualization that unfortunately became too coupled with Virtualization.

The next step is obviously to move from SSC to application and middleware level Virtualization. Application Virtualization refers to the opposite scenario. Instead of putting multiple applications on the same hardware, we are taking a given application and spreading it on a pool of machines. This holds significant potential for making applications more efficient. For example, just think of the saving potential gained by moving applications from static peak-load provisioning to on-demand provisioning. In addition to that, we can utilize commodity hardware resources to get the power of high end machines.


Practical steps ( Slide 19)

One of the main concerns that most people have WRT to application level virtualization is the effort associated with applying those principles.

The table below captures the value vs effort that we can receive with each part of our middleware and application level virtualization.
Practical-steps 

As you can see, there are steps that require non or very little changes to our application:
For example, taking an existing web application that requires 10 machines to meet certain peak load, however on average it needs only 3 machines.  Normally we would statically provision 10 machines to meet the peak load which means that on average those machines would be poorly utilized. .  Instead, we can provision our web container on-demand and use Pay-Per-Use model to pay only what is consumed. This will enable us to save roughly 7 machines (10 - 3 ) and wouldn't require any changes to our code. (You can learn how you can do that with GigaSpaces here)

If we have a computational business logic or even a rendering application (a good example of that is  Slideshare or Yutube). Those type of applications tend to deal with fluctuating loads. So if on average we consume 10 machines and for peak we need up to 90 machines - you will be able to save ~80 machines! by provisioning the computational/rendering machines on-demand. (See here on how to use GigaSpaces for MapReduce computation and here to learn on how to use the Actor model)  

On the messaging and data side, we can reduce the amount of machines and the cost of scaling by partioning our messaging or data thus enabling linear scaling of those layers. In addition to that utilizing memory resources instead of files provide a significant boosst in performance. The combination of the two enables us to utilize commodity software and hardware resources instead of high-end resources - for example we don't need to rely on high-end databases, we can simply use MySQL.

Final Words

In this presentation we tried to gather most of the knowledge based on our past experience. Most of these lessons are generic and not necessarily specific to GigaSpaces.

I also tried to provide some practical implementation guidelines (Slide 19):

  • Avoid radical change, enabling a gradual process
  • Choose an architecture supporting linear scalability
  • Minimize vendor lock-in
    • Enable application portability and freedom of choice of:
      • Cloud provider, Web container, Programming language, Database
    • Minimize API lock in:
      • Use of standards
      • API Abstractions – when standards are not available
  • Future proof your application
    • Don’t make decisions today, but be ready to make one without major effort
    • Avoid long-term commitment – choose the right licensing model

Real life case studies:

I would like to encourage you again to listen to the case studies (Slide 20) and learn how others  apply some of those principles in their local-IT and Cloud.

Additional references:

December 09, 2008

Latency is Everywhere and it Costs You Sales - How to Crush it - My Take

Over on HighScalability.com Todd Hoff posted one of the comprehensive articles on latency that I've read titled Latency is Everywhere and it Costs You Sales - How to Crush it. It covers almost every aspect of latency, and is a must-read on the subject. Todd provides a good explanation of how Space-Based Architecture helps in reducing latency through collocation of tiers and by utilizing memory to remove the I/O bottleneck:

The thinking is that the primary source of latency in a system centers around accessing disk. So skip the disk and keep everything in memory. Very logical. As memory is an order of magnitude faster than disk it's hard to argue that latency in such a system wouldn't plummet.

Latency is minimized because objects are in kept memory and work requests are directed directly to the machine containing the already in-memory object. The object implements the request behavior on the same machine. There's no pulling data from a disk. There isn't even the hit of accessing a cache server. And since all other object requests are also served from in-memory objects we've minimized the Service Dependency Latency problem as well.

In this post I wanted to summarize my take-aways from Todd’s article and add some of my own thoughts based on my experience with GigaSpaces customers.

Sources for latency – is it the network or the software?

When discussing latency most people fall into one of two main camps: the "networking" camp and the "software architecture" camp. The former tends to think that the impact of software on latency is negligible, especially when it comes to Web applications.

Marc Abrams says "The bulk of this time is the round trip delay, and only a tiny portion is delay at the server. This implies that the bottleneck in accessing pages over the Internet is due to the Internet itself, and not the server speed."

The "software architecture" camp tends to believe that network latency is a given and there is little we can do about it. The bulk of latency that we can control lies within the software/application architecture. Dan Pritchett's Lessons for Managing Latency provides guidelines for an application architecture that addresses latency requirements using loosely-coupled components, asynchronousinterfaces, horizontal scale from the start, active/active architecture and by avoiding ACID and pessimistic transactions.

So is it the network or the software?

The simplest way to answer this question is to run a mockup test that removes the impact of the software on latency from the equation.

Global optimization vs Local optimization

To better understand latency optimization we can use the analogy of a plant production line. We have a big pipeline of things that need to get done and we need to look at each element in the pipeline to optimize our production latency. In an earlier posts - Moving to Extreme Transactions Processing using Lean methodology - I discuss how we can apply the same principles used in manufacturing line optimization, such as the Lean methodology, in software systems. I tried to illustrate the applicability of some of the core principles of lean methodology. Here’s a recap:

In many cases, we can get more bang for the buck by looking at an extended value-stream, as opposed to a localized one. Local optimization means digging into the latency path in a specific component in our system. With global optimization, however, we look at the entire pipeline and optimize at that level.

An example of local optimization would be looking at our messaging system and lowering the latency of sending a message from point A to point B. An example of global optimization would be looking at the end-to-end transaction. Processing a typical transaction involves sending messages through a messaging system, consuming it, and then updating the database. If we collocate the message queue with the data receiver, we can easily eliminate half of the network hops. Additionally, if we’ll use the same storage for messaging and data, we can avoid the 2-phase commit overhead. At the same time, we can analyze the user experience to see how many clicks it takes to perform a given operation. By reducing the number of clicks we can reduce the perceived latency much more then we can by reducing the time it takes to process each click. It’s easy to see how we can get better latency savings by taking the global optimization view. Global optimization often has much more room for optimization than a local one.

If our system is already designed with a scale-out model, adding more machines and spreading the load is much simpler than trying to apply local optimizations.

Scalability as a major source for latency.

One topic that is often missing or less understood in many latency discussions is the impact of scalability on latency.

Todd writes: "We put shards in parallel to increase capacity, but request latency through the system remains the same". This statement is a common fallacy. It assumes that each request is completely independent of the others. In reality, however, if the application is not designed with a scale-out/share-nothing approach then at some point it will hit a shared contention, which makes those supposedly parallel requests dependent on each another. Contention happens when multiple concurrent users or business-requests hit a shared resource at the same time. A shared resource might be a hardware resource -- such as CPU, memory or disk -- or a software resource, such as a shared database lock. Shared resources need to be freed before another request can be processed through them. This contention time is proportional to the number of concurrent attempts to consume the shared resource and the duration in which the resource is locked. This is one of the basic principles of Amdahl’s Law, which shows that to increase processing capacity of a request that spends 10% of its time on a shared lock will require a 100x increase to CPU power. This contention time, therefore, must be added to our “latency path”. In a non-scalable system this will be proportional to the number of concurrent requests, meaning it will rapidly lengthen as the system load increases, up to the point in which the system will face “resource starvation”. (See further discussion of this here ).

Based on my experience, hardware and software contentions are some of the main contributors to latency. This is partly due to the fact that network overhead latency is relatively fixed, while application overhead latency is variable. It is extremely complex to design a fully-optimized software application.

Scalability happens to be one of those things that are often implemented in a non-optimized manner, and as mentioned above, lead to latency. The only way we can reduce the scalability overhead on latency is by reducing the contention points in our application. The typical method for reducing the contention is through “sharding” (partitioning) the access to those resources using a share-nothing approach. Disks are less concurrent than memory, and therefore, removing the dependency on disk access in the critical path of the operation is one of the keys to a latency reduction strategy. This is one of the key principles of Space-Based Architecture:

The impact of peak load provisioning on the latency cost

Another source of latency is related to provisioning. Many web sites uses static provisioning based on peak load. But how do we measure peak load? With the introduction of social networks, and phenomenon such as the Digg Effect, it becomes extremely hard to predict peak loads, as user traffic is subject to “viral behavior” leading to sudden spikes in traffic. The further ahead we try to plan, we increase the chances of missing the target. This will lead to one of two outcomes: 1)Over-Provisioning – in which case latency is not harmed, but we unnecessarily pay the cost of more servers and other resources than we normally need. 2) Under-Provisioning -- in which case our site may significantly slow down or even crash.

Use on-demand scaling to smooth the latency peaks

If we can't predict the peak loads accurately, we need to scale the system rapidly whenever we see it is approaching capacity. If the system was not designed for scale-out (linear scalability), the process of scaling will involve a substantial amount of work and tuning, which is time-consuming and therefore defeats the purpose.

A scale-out approach enables us to scale on demand and smooth out the impact of load spikes on application latency by adding servers when the load is up and removing unnecessary ones whether load is reduced. In this way we can cost-effectively control latency.

Cloud computing and virtualization enable us to build such an “elastic computing” model with significantly less effort than previously necessary. For example, the GigaSpaces Cloud Framework already supports on-demand scalability for web containers.

My colleague Shay Hasidim posted a latency-benchmark that measured how low-latency is maintained by increasing the number of servers.

Web_bench2[1]

From the results above we see that as we increase the number of web servers system contention (scalability barrier) grows in terms of the number of concurrent users. With a single server, latency increases starting with 100 concurrent users; with two servers, at 300 concurrent users, and with three servers -- 500 concurrent users.

To ensure linear scalability on the web-tier, we must ensure that the underlying data-tier scales-out at the same level as can be seen in the diagram below. In this case we used GigaSpaces’ In-Memory Data Grid as a front-end to a MySQL.


Web_bench3[1]

With the graph above we see that the IMDG scales very close theoretical linear scalability. The above results were achieved with an IMDG running on 2 partitions. Better scalability can be achieved by increasing the number of partitions.

Read more about how to scale-out the data-tier in Scaling-out MySQL.

To enable this level of on-demand scalability we used our new Cloud Framework, which combines the GigaSpaces SLA-driven container as the application deployment virtualization layer, Amazon EC2 as the machine level virtualization layer, and the GigaSpaces application server as the middleware virtualization layer. This way we can provision new machines as soon as the SLA on the web-tier is breached (measuring latency, in this specific case). When such an event happens we launch new machine instances on EC2. A new web container is provisioned on these machines through the GigaSpaces SLA-driven deployment system. An apache load-balancer agent is responsible for synchronizing the load-balancer whenever a new web container joins the cluster. Using this approach we can achieve end-to-end dynamic scalability, starting from the load-balancer, through the web-tier and business-tier, and ending with the data-tier.

It is important to note that while this test was performed on EC2, there is nothing that bounds the solution specifically to the EC2 environment. In fact, we used the same exact model to enable dynamic scaling on private-clouds using GigaSpaces and the Sun Grid Engine, for example. A more detailed description of that is available here.

Data Query latency

Query latency is the time it takes to process a query request and receive the result. There are a few factors that influence query latency:

  1. The time it takes to access the data (read it from file in case it is stored on disk)
  2. Contention - the time spent on a shared lock to access the data
  3. Complexity of the query - the number of calls involved in executing the query

We can address each of those issues as follows:

  1. File systems are not optimized for concurrent access. In addition, file systems are stream-based systems that enforce serialization and de-serialization of the data every time we wish to access it. An easy way to eliminate this overhead is to put the data in memory, which enables access to it using a direct reference. (See further details in InfoQ Article - RAM is the new disk).
  2. We can reduce contention by partitioning the data, which also results in partitioning the lock. Putting the data in-memory also reduces contention because memory is much more concurrent than disk, and we don't need to inherit global file system locks. Instead, each data item can have its own locking. This will enable much more concurrent access to our data.
  3. Collocating the business logic with the data - We can reduce the number of remote calls required for each query using a stored procedure approach, meaning the business logic runs collocated with the data. For partitioned data we will need to use a MapReduce-like pattern to enable execution of the queries on distributed data sources. The fact that our data source is now partitioned enables us to reduce the time it takes to query compared with running the same query in a centralized database for the following reasons:
  • The data-set per partition is smaller; and

We can leverage the full capacity of the CPU/memory of each partition to get more power to process the query

Garbage collection impact on latency

Another source of latency that I found missing in Todd article is the impact of Garbage collection. Garbage collection is used in any Java or .Net application. Garbage collection runs as a background thread that cleans all the unused object in the JVM. In early versions of Java the Garbage collection implementation used a synchronized block on the entire memory during the time the garbage collection cleaned those unused objects. This hiccup time is dependent on the size of memory, number of CPU's and number of objects that are freed between each GC cycle. In those early versions of Java it was a common practice to use Object pooling as a way to reduce this hiccup. Object pooling basically bypassed the GC work and we had to take control over object lifecycle in our code. Having said that Object pools themselves became a shared resource and source for contention. As of Java 5 the GC algorithm was improved to enable more concurrent garbage collection. This means that the hiccup time was curved-out over the time therefore had less impact over our application peak performance. This holds true as long as we have enough CPU cycles to spend on the GC cycles. The caveat is that if our application consumes 100% of the CPU all this optimization is not going to help as when the GC hit our system it will compete with our application time and therefore the end result is long hiccups again. Real-time VM aims to address this problem by spreading CPU cycles between the application threads and GC threads in deterministic manner. I.e. it will slow down our application in some cases to ensure that GC gets enough cycles and in that case provide more predictable latency behavior on behalf of throughput.
The JVM comes with different switches that enable better control over the GC behavior and provides means to adjust the GC time. One thing to note about GC optimization is that it tends be close to a voodoo art. it works in certain scenarios and break in others so It is very hard to find the right combination.

Avoiding GC hiccups - Avoid over utilization

Based on my experience, the simplest and most effective way to avoid GC hiccups is to avoid over-utilization. This means that we need to plan our system in a way that wouldn't consume more then 80% of the CPU and memory under peak loads (you can choose a different threshold that may be more appropriate for your organization). For example, if the servers can process up to 500 requests per second at 100% utilization, and we have a requirement to process 1000 requests/sec, it is better to provision three machines, each processing roughly 330 request/sec, rather than two machines that are maxed out at 1000 request/sec. We also need to make sure that we have the right proportion between memory and CPU. For intensive read/write applications, I would go with at least 1CPU/2GB, and if possible, even 1CPU/1GB. These rules of thumb should get you most of what is needed in terms of latency. Obviously if that’s not enough, then you need to dig deeper into the GC flags or consider a Real-Time JVM, but you should use those options as a last resort.

A note 64 Bit VM provisioning:

Lately I've experienced some cases where people thought that they can use 64-bit machines and large memory heap sizes to reduce the cost of the system (mostly due to software license costs and machines maintenance fees). The assumption was that if with 64-bit each process can manage more memory, they can use fewer machines and fewer processers. What they didn't take into account are the considerations I presented above. The number of CPUs needs to be proportional to the amount of memory, and not just to the number of VM processes running the IMDG. This means that using 64-bit VMs can reduce the number of machines, but might have almost no impact on the number of CPUs the system will leverage. As for the amount of memory that each process can handle - that number tends to vary widely, so I don't feel comfortable giving a concrete number other than to say that I know of systems using the GigaSpaces product with 8GBs per process.

The cost of latency

Everything we do has a $ value associated with it. Latency is no different. Todd mentioned in his post some of the issues related to latency cost.

The cost associated with losing users due to a bad user experience – this measurement is typical for e-commerce, social networking and search engines sites: "Amazon found every 100ms of latency cost them 1% in sales. Google found an extra .5 seconds in search page generation time dropped traffic by 20%."

Another cost associated with losing trades – in this case the cost is a measure of the chance of losing business when your competitor can trade faster than you do: “A broker could lose $4 million in revenues per millisecond if his electronic trading platform is 5 milliseconds behind the competition.”

Another aspect that was not mentioned is the operation costs of achieving latency targets. This cost factor applies to latency in the same way it applies to other scalability operational costs.

The cost of over provisioning
– if the system was not designed for on-demand scaling then we are probably spending money on over-provisioning. Meaning the system is statically provisioned to have more machines than we actually need on average, and we pay the costs of under-utilization (idle resources waiting for peak loads).

The cost of failure – if the system was under-provisioned, then we are likely to face the cost of downtime. According to a Forester survey conducted with 235 organizations, 33% estimate the hourly cost of downtime at $10k-$100k , 25% at $100k-$500k, and 13% $500k-$1M.

How cloud computing can help to improve latency and save some of the latency cost

  1. Built for on-demand scalability – cloud computing is a great enabling infrastructure built for on-demand scaling.
  2. Geographically distributed – we can improve latency by running our servers close to the geographical location of the user. Quoting Todd’s article again: "Facebook opened a new datacenter on the east coast in order to save 70 milliseconds "

We can now have data centers spread around the globe at our disposal making it easy to run our applications in those different data center locations and point the user to the closest location at a fraction of the cost.

It is true that all this can be achieved without cloud computing. But cloud computing reduced the barrier to entry so that even the smallest startup can apply these optimizations, previously considered a luxury that only big companies could afford

My 20/80 rules for achieving predictable latency

I'm sure that many readers are aware of the fact that out of the many possible sources of latency, there are some that are beyond our control: Internet routers, for example.  One of the key questions I ask myself in relation to latency is whether there is a 20/80 rule.  What are the 20% of the things I should focus on that will help me reduce 80% of the latency. In this section I'll try to provide the guidelines I use for designing a system for optimum latency.

  1. Focus on application architecture and leave hardware and OS optimizations as a last resort. The performance provided by commodity hardware should be good enough for 80% of cases. In addition, the effort of optimizing hardware and Internet routers might involve a huge investment, and therefore, should be used sparingly. If you’re not sure whether the source of latency in your application is the network or the software, run the tests I mentioned above.
  2. Start with global optimizations – before you begin to optimize the database, the router and the messaging system, look at the entire pipeline of your business request. By looking at that global level, you may find that parallelizing some part of the request, or changing some of the reliability/consistency requirements, may have a much bigger impact than any local optimization.
  3. Use Spaces Based Architecture principles (even if you’re not using GigaSpaces) – quoting Todd again:
    1. Co-location of the tiers (logic, data, messaging, presentation) on the same physical machine (but with a shared-nothing architecture so that there is minimal communication between machines)
      1. Assemble/Collocate your application components based on the runtime/execution flow dependency and not based on their function in the system. For example if each request need to go through various steps such as parsing, validation, matching and execution it doesn't make sense to do each of those steps in separate process/tier. Instead you can make sure that all of those steps will be collocated and split the application into multiple units each containing all those various components. In SBA we refer to those units as processing units. This is probably one of the main difference between Space Based Architecture and Tier based architecture. In tier based approach our application is broken down into presentation tier, business logic and data-tier where in SBA we tend to collocate all of those tier as much as possible and split the application into multiple horizontal units each containing all the tiers to reduce the amount of moving parts and network hops.
    2. Co-location of services on the same machine
    3. Maintaining data in memory (caching)
    4. Asynch communication to a persistent store and across geographical locations 
      1. Avoid calling any disk/database operation at the critical path of the execution. With the addition of data-grid we can use in-memory data as the system of record. This enables us to avoid data or file access during the critical path of the user request and delegate the update to the data base as an asynchronous operation.
      1. You can add to Todd summary the other pieces associated with Query optimization such as the use of Map/Reduce and moving the logic to the data is located that I laid out above.
  1.  Design your system for dynamic scalability – Dynamic scalability doesn't necessarily means that scaling needs to happen in real time. It means that scaling can be done without changing code and the cycle of scaling is short. In many real-life scenarios “short” could mean a day or even a week.
  2. Provision correctly - Avoid over utilization.
  3. Other tips for optimizing the architecture:
    1. Decouple application components - Use SOA and EDA to make your application granular enough so that you can easily change the way you assemble the different components of your system without code changes. This flexibility is important as it will allow you to decide at different stages what your business logic pipeline is going to look like. It will allow you to optimize later, such as collocating elements that have strong dependencies among them from a business perspective. 
    2. Abstract your communication layer - Abstracting the network layer enables latency reduction when the components are collocated. This abstraction is also important to enable easy plug-ins of different transports without changing code. Assume that in the future new protocols, transports and other technologies will be introduced. By decoupling your code from the transport you can easily plug them in when they become available.


In most cases, following these steps gets you most of what you need. It also provides a good basis for eliminating many of the factors that make latency optimization on other layers more difficult. For example, if we run the business logic in a collocated in-process mode we isolate the impact on our code from external factors such as routers. It also provides a good model for troubleshooting and optimizing the system in case latency goes wrong. Rather than dealing with latency as a big-bang project, we should break it down into the levels that enable us to deal with the latency problem in a more gradual manner.

November 27, 2008

Data agregation pattern for effective monitoring

In my previous post I wrote about two patterns for using a GigaSpaces cluster to solve some of the issues involved in managing distributed applications:

  • Using the space as a scalable alternative to a directory service. With this approach each service publishes itself to the space directory service and a JMX proxy acts as a wrapper that virtualizes the different managed services from the client accessing them.
  • Using the space as a management repository aggregating management data.

Steve Colwill from PSJ published a second post on this topic titled Data Aggregation via JMX and the Grid that covers the second option outlined above.  In summary, Steve suggests that each managed agent will report its aggregate statistics to the space. A JMX façade is used to expose the statistics through a standard JMX API as described in the diagram below:

Agregating data2

If you think about it, this is yet another proof of the old aphorism by Butler Lampson that "All problems in computer science can be solved by another level of indirection".

The thing about indirection though is that it basically moves the problem from one layer (the application layer or management layer in our specific case) to another layer (infrastructure layer).  Now if that infrastructure layer is not in place, we have to create it ourselves, which makes Butler's statement kind of empty. This is where Space-Based Architecture becomes handy. It serves as a general purpose infrastructure layer that provide a means for solving data distribution, high availability and scalability. In this case we applied the abstraction principle as a way to expose the generic space capabilities through a specific set of APIs (JMX in this case). The combination of the two creates what I often refer to as a middleware virtualization layer. We use the same API, but the implementation of this API is virtualized. In this specific case, our JMX API doesn't point to a physical JMX server, instead it points to a virtualized cloud of servers.

Quoting Steve: 

One of the reasons I'm a fan of GigaSpaces and space-based architectures is that a number of architectural choices that are traditionally hard-wired: transactional/non-transactional, sync or async replication can be changed through configuration only. This enables common design patterns (and therefore components) to be applied to a wide range of application problems, by enabling the data integrity/performance equation to be tweaked at a late stage of application assembly.

 

Summary

The main benefit of this approach compared to the one described in the previous post is that it decouples the client and the managed services. A client doesn't need to maintain direct a connection to the managed service. Most of the logic is kept server-side. In this way we can keep the client-side --  which acts as the management façade -- stateless and thin.

Because the space acts as a distributed data-store, it is easier to build aggregated statistics, as all the statistics are placed in one logical entity - the space.

Decoupling enables us to easily add new services/policies to our system without changing the management service. For example, we can add SLA monitors that can listen to the statistics in the space and take action once a certain threshold is breached. This capability makes the space an ideal solution for those looking to build their next-generation management and monitoring application.

Cloud management frameworks (for Infrastructure-as-a-Service) are ideal candidates for this, as they have the need for proximity of the management information and application behavior. The management layer acts as a loopback mechanism that can tell the application how it is actually behaving. The application can use this information to adjust itself to meet a given SLA without human intervention.

Having said all that, it is also important to note that the two options presented in Steve's posts are not necessarily mutually exclusive. I could easily see how one could use the approach presented in this post for maintaining the managed dashboard information, and the approach in the previous post to invoke specific operations on a managed service, in case you want to drill down to the managed service level. Having one underlying technology that can be used to serve the invocation, virtualization and data virtualization is yet another benefit of using the space.

November 14, 2008

Private/Public Cloud

Most data centers of today run applications on dedicated machines. This is often referred to as static provisioning. In addition, applications are typically provisioned to handle expected peak loads. Both lead to over-provisioning and low resource utilization.

John Foley wrote an article back in August Private Clouds Take Shape in which he describes how data centers are reshaping themselves by taking ideas from public cloud providers, such as Amazon and Google.  The idea is to make the data center more cost-effective by enabling on-demand utility-based computing rather than dedicated machines.

The shift towards a utility data-center is a game changer. It will change the way IT operates, the way applications run in the data center as well as the culture of IT organizations.

The push to private clouds has a strong momentum these days, as all the major players, starting with the hardware vendors and ending with virtualization vendors, realize that their future rests in how well they fit in this model. Microsoft's Azure announcement one of the most significant announcements from a major vendor so far.

The need for Private/Public Clouds

At the same time, it is clear that to make IT operations more effective, it doesn't make sense to run all the applications that are currently hosted in a company's data center in the private cloud.  Not all applications in the data center are mission-critical or production systems. For example, take staging or testing environments. Such environments are supposed to be a mirror of the production environment. This is reasonable when our production system runs on a single dedicated server, but what if it runs on 10 or even 100 servers? Does it make sense to have another 10 or 100 dedicated servers just for that purpose? Another good example is disaster recovery. Disaster recovery sites require us to double our resources, let alone the cost associated with maintaining two separate data centers. These are classic scenarios in which running applications on a public cloud could lead to huge cost savings.

A recent InformationWeek survey (which Foley mentions in his piece) provides a more detailed view on the types of applications likely to move from private clouds to public clouds.

Upcomming cloud computng analysis (Information Week)

Making your application ready for the private/public cloud

The challenges

There are a few challenges to be aware of if we want as ready applications for a hybrid private/public cloud:

1. How do we design applications to be cloud-agnostic: how do we perform application testing on a public cloud and then run that exact same application in production on a private cloud. For the application to be cloud-agnostic we need to ensure that neither our application code or configuration is going to change by the transition and that our application is going to behave the same in both environments.

2. How do we enable seamless fail-over to a public cloud? To enable a disaster recovery scenario, the public and private clouds need to be connected in a way that enables seamless fail-over from the private to the public cloud

3. Future-proofing: There are many cases in which we can't make a clear decision as to where our application should be running at the time of writing or developing the application. We would like to be in a position to change the decision as to where our application will be running even after our application has been completely developed.

The solution

1. Enterprise-ready Platform-as-a-Service (PaaS)

Many recent discussions on cloud computing have been centered on the low-level infrastructure, such as virtualization. This is sometimes referred to as Infrastructure-as-a-Service (e.g., Amazon EC2). It is clear that to address the the first and the third challenges mentioned above we need a new middleware stack that will provide generic services for running applications in a virtualized environment, or a platform-as-a-service.

Pass You can read more about this layer in GigaSpaces as Alternative to Google AppEngine for the Enterprise. The role of the enterprise-ready PaaS will be similar in nature to that of the application server of today, only that it will broaden to support the needs of private cloud environments, as I outlined in my earlier post.

2. Cohesive FT's VPN Cubed

While entperise-ready PaaS shields application code from the underlying cloud infrastructure, CohesiveFT's VPN Cubed is responsible for connecting one or more cloud networks through a secured channel in a way that makes them all appear as one big cloud, even if they are not owned by the same provider.

See for yourself how it works live!

My colleague Dekel Tankel  blogged about the joint solution by GigaSpaces and CohesiveFT aimed at addressing these challenges. The solution will be presented in a webinar next week Making Cloud Portability a Practical Reality.

We will show how you can write and deploy standard applications on top of GigaSpaces' Cloud Framework and use CFT's VPN Cubed for a seamless transition across clouds. Even more interesting is that using this solution you can even use multicast discovery across clouds.

By "standard application" I mean that you can deploy a standard JEE web application packaged as a WAR and deploy it in the multi-cloud environment. It doesn't need to be a "GigaSpaces application". In the webinar next week we will show live how we can deploy an application across both Amazon EC2 and Flexiscale, kill one of the machines and see how the application fails-over seamlessly between the clouds with zero downtime.

The webinar will take place next week: November 18, 2008, 1:00 PM EST.

You can register online here




November 10, 2008

The impact of cloud computing on build vs. buy behaviour

Last week I took part in an interesting discussion with a group of architects, and the question of build vs. buy came up. It came up specifically in the context of the recent experience with many of new Internet companies. I was wondering why it is that that many of them seem to spend so much in developing their own proprietary infrastructure, when it's clear that their needs are not that unique and that such development is not really part of their core IP. Many of them seem to continuously go through difficult experiences until they get their infrastructure right. And it seems that they all stumble into the same pitfalls along the way.

The typical answers as to why they build vs. buy were:

  • It's core to our intellectual property and therefore we have to own all of our infrastructure
  • We didn't find a solution that fits our needs since our needs are very unique
  • We had a bad experience with Product FooBar which made us reevaluate build vs. buy

I can see how I'd react in exactly the same way; the most basic human instinct, when entering uncharted territory, is to rely only on yourself.

But looking at the amount of repeated failure over the past few years, it's pretty clear that this pattern isn't really proving itself too well either. Even when we choose to build it ourselves, according to our own specific in-house requirements, we still end up falling into the same trap over and over again.

Where to draw the line of build vs. buy?

To answer that question, I looked at Fred Brooks's article "No Silver Bullet" which was pointed out to me again by one of our lead architects few weeks ago.

One of the interesting points was the drastic impact of the economy on the build vs. buy decision pattern:

"The development of the mass market is, I believe, the most profound long-run trend in software engineering. The cost of software has always been development cost, not replication cost. Sharing that cost among even a few users radically cuts the per-user cost. Another way of looking at it is that the use of N copies of a software system effectively multiplies the productivity of its developers by N. That is an enhancement of the productivity of the discipline and of the nation.

The key issue, of course, is applicability. Can I use an available off-the-shelf package to perform my task? A surprising thing has happened here. During the 1950's and 1960's, study after study showed that users would not use off-the-shelf packages for payroll, inventory control, accounts receivable, and so on. The requirements were too specialized, the case-to-case variation too high. During the 1980's, we find such packages in high demand and widespread use. 

What has changed? Not the packages, really. They may be somewhat more generalized and somewhat more customizable than before, but not much. Not the applications, either. If anything, the business and scientific needs of today are more diverse and complicated than those of 20 years ago.

The big change has been in the hardware/software cost ratio. In 1960, the buyer of a two-million dollar machine would have felt that he could afford $250,000 more for a customized payroll program, one that slipped easily and nondisruptively into the computer-hostile social environment. Today, the buyer of a $50,000 office machine cannot conceivably afford a customized payroll program, so he adapts the payroll procedure to the packages available. Computers are now so commonplace, if not yet so beloved, that the adaptations are accepted as a matter of course."

The impact of cloud computing on the buy vs. build decision

I think Fred's analysis above is much more than just a historic curiosity. Exactly the same process is playing out today, with the advent of cloud computing and virtualization techniques that are turning IT infrastructure into a commodity, on the road to becoming a utility, and dramatically reducing its total cost. 

As Fred says in his paper - when the hardware gets cheap, development becomes very expensive. Under these new conditions, we're all going to have to change how we evaluate off-the-shelf products compared to the alternative of developing in-house. Proper TCO measurements need to be put in place at an early stage of the decision making process. 

For example, it will no longer be sufficient to choose a product based on the "best performance" or even "best reliability," because each of those factors has a direct cost associated with it. Instead, we are forced to have a better picture of the business requirements, so that we can choose the right product to meet our business needs, and it's not always going to be that the best product from a technical perspective is the right product - and the cheapest product won't be the right product either.

"The hardest single part of building a software system is deciding precisely what to build. No other part of the conceptual work is as difficult as establishing the detailed technical requirements, including all the interfaces to people, to machines, and to other software systems. No other part of the work so cripples the resulting system if done wrong. No other part is more difficult to rectify later."

It is quite surprising to see how much of the current decision-making process is not based on real business requirements. It is even more surprising to see how little we as architects and business people know about their system requirements and real application behavior.

A good example that was given in the architect meeting is the user experience. One participant in the discussion said that at one point, he was focusing on making the latency of serving his site pages as fast as possible and did a good job at that, but at the end of the day, when measured against a competing site that was performing slower, the impression of the user was that the competing site was performing better - the reason was simple, the other site was focused on user experience which led to less clicks per request and not how much time a single request is being executed.

If using off-the-shelf products can cut costs dramatically, why are there are so many product failures?

Fred provide an interesting answer to that question as well:

"Much of present-day software-acquisition procedure rests upon the assumption that one can specify a satisfactory system in advance, get bids for its construction, have it built, and install it. I think this assumption is fundamentally wrong, and that many software-acquisition problems spring from that fallacy. Hence, they cannot be fixed without fundamental revision--revision that provides for iterative development and specification of prototypes and products."

Final words

You might be thinking by now that these are all new lessons learned from the recent changes in the economy, right? – wrong. Go check when Fred Brooks' article was written.

If anything, I would strongly recommend that everyone reading this post would spend time reading Fred's article from start to finish, because I've only covered a small part of the philosophy behind his paper. I think the paper's viewpoint is extremely relevant today -- perhaps even more relevant today then it was when he originally wrote it.


November 05, 2008

Managing application on the cloud using a JMX Fabric

One of the challenges of managing application in a distributed environment such as Cloud/Grid is that collecting or finding the management information of each part of the application is a relatively complex task.

JMX provides a standard way to expose the management information (MBean) of a particular server. However, the way the client-side finds all the MBeans that comprise the application, or the way a single client might interact with the distributed parts of the application, is left open.

Steve Colwill from PSJ wrote a detailed blog, JMX for Grid Based Applications,
where he outlines a solution that uses JMX JSR-160 connectors and GigaSpaces to create a JMX Fabric. According to the proposed solution, the managed agent (server side) use the connector to add a reference of each MBean stub to the space. The client uses a FederatedMBeanServerConnection class that picks up those references from the space, connects to them and then delegates operations to the set of Mbean servers, effectively acting as a multiplexer.


Federated-jmx2 Using the space as a JMX directory service

The above diagram illustrates how the model described by Steve works. The client is abstracted from the physical location of each server and can easily discover services that join the network. The connection from the client to the servers uses peer-to-peer communication, which means that once the service is discovered, no additional overhead is needed for communication between the client and the managed service. In this case, the space is used as a directory service. We leverage the fact that it can be distributed and dynamically discovered to simplify the discovery process in a distributed environment.

Using the space as a management data repository

The above model is quite useful for cases in which we want to expose federated services which have an existing remote interface. But this is not always the case. If it isn't, we can use the space as a management data repository, which contains full management information for each agent and exposes that information to the client or to any management application. In this method too, the client application is abstracted from the managed service. But unlike the first option, the client gets the information about the managed entity directly from the space, and doesn't need to maintain a connection with the managed service. The space in this case is used as a distributed database, so the application can not only obtain management information about an individual server but can also gather aggregate statistics and perform other aggregate data queries, directly on the data model.

Summary

Steve's solution to managing application in a distributed environment is an interesting one, as it enables applications that are already using a standard JMX interface to use a new federated model without changing the application and without adding a performance overhead. This is achieved just by plugging a new space-based connector. It is a good example that shows how a space can be used as a distributed directory service. It is important to note this is only one pattern in which a space can be used to solve this type of challenge. There are other ways; using the space as a management data repository, as I suggested in this post, is just one of them. The nice thing is that implementing any of these patterns becomes fairly simple once the space is brought into the picture.

I would like to end this post by thanking Steve specifically and PSJ in general for being a great partner for such a long period of time, and for sharing your experience in such meticulous detail.

October 04, 2008

Is MapReduce going mainstream?

I'm getting a lot of questions lately about the use of MapReduce: how it compares with other technologies such as Grid, and how the the different solutions that claims support for MapReduce (GigaSpaces included) fit into the puzzle. A good starting point is the intense discussion on the cloud computing mailing list under the topic: "Is Map/Reduce going mainstream?" where I contributed some of my own thoughts on the topic.

To summarize the questions on this topic, I'd state it as follows:

  1. How can we reduce the barrier-to-entry for implementing MapRreduce specifically, and parallel processing applications in general?

  2. Many of the use cases for MapReduce represent some sort of data analytic application. But can MapReduce be used as a generic parallel processing mechanism? Specifically, is it suited to deal with issues such as data affinity, asynchronous batch processing, etc.?

In this post I'll try to answer these questions, but first, a few clarifications:

What is MapReduce?

Quoting from the Wikipedia definition

MapReduce is a software framework introduced by Google to support parallel computations over large (multiple petabyte[1]) data sets on clusters of computers.

Why do we need a new model for processing large data sets?

Unlike central data-sources, such as a database, you can't assume that all the data resides in one central place, and therefore, you can't just execute a query and expect to get the result as a synchronous operation. Instead, you need to execute the query on each data-source, gather the results and perform a 2nd-level aggregation of all the results. To speed the time it takes to run this entire process, the query needs to be done in parallel on each data source. The process of mapping a request from the originator to the data source is called "Map"; and the process of aggregating the results into a consolidated result is called "Reduce".

MapReduce implementations

Hadoop is the most well-known MapReduce implementation.

Hadoop is an open source project that implements the exact spec defined by Google in Java. As such, it was designed primarily to enable MapReduce operations on distributed file systems and was not really designed as a general purpose parallel processing mechanism.

The wikipedia entry on MapReuce (http://en.wikipedia.org/wiki/MapReduce) has references to other implementations in other languages, including Greenplum, Skynet, and Disco.

Other forms of MapReduce implementations

Over time, the term MapReduce has expanded in definition to describe a more general purpose pattern for executing parallel aggregation of distributed data-sources, rather than referring to a specific type of implementation. GigaSpaces, GridGain, and to a degree, Terracotta, all took a different approach than Hadoop in their MapReduce implementations. Rather than implementing the exact Google spec in Java, these three aimed to take advantage of the Java language and make the implementation of the MapReduce pattern simpler to the average programmer (I'll get back to that later).


How MapReduce differs from other grid implementations?

Compute Grid

While MapReduce represents one form of parallel processing for aggregating data from distributed data sets, it is not the only one. "Compute Grid" is a term used to define another form of parallel processing, used mostly to compute intensive batch processing. A typical batch processing takes a long-running Job, breaks it into small tasks and enable the execution of those tasks in parallel to reduce the time it takes to execute the job (Compared with the time it would have taken to execute the tasks sequentially). This model is a good fit for executing relatively compute-intensive and stateless jobs. A typical scenario for this would be a Monte Carlo simulation, such as the one used to perform risk analysis reports in the financial industry. This type of analysis is more compute-intensive than data-intensive. Most compute-grid implementations have the following components:

  1. Scheduler

  2. Job executor

  3. Compute agent

The executor submits jobs. The scheduler is responsible for taking the job, splitting it into a set of small tasks (this process requires specific application code) that are sent in parallel based on a certain policy to a set of compute nodes. The agents on each compute node execute those tasks. The results of those tasks are aggregated back to the scheduler.

The scheduler is responsible for monitoring and ensuring the execution of the tasks. The scheduler was designed to support advanced execution policies, such as priority-based execution as well as advanced life-cycle management.

Master/Worker Pattern (simple Compute Grid)

The Master/Worker pattern is a simplified version of parallel batch execution, based on the Tuple Space model. Tuple Spaces emerged from the Linda project at Yale university. JavaSpaces is the main Java implementation of the model. A good description of this model is provided in this article. In a master/worker pattern, tasks are assumed to be evenly distributed across worker machines. In this case there is no need for an intermediate scheduler. Load-balancing is achieved through a polling mechanism. Each worker polls the tasks and executes them when it's ready. If a worker is busy, it simply won't process the tasks, and if it is free it will poll the pending tasks and process them. Consequently, Workers running on a more powerful machine will process more tasks over time. In this way, load balancing is implicit, supporting simple task distribution models. For this reason, master/worker implementations tend to be more useful for simple compute-grid applications.The fact that there is no need for an explicit scheduler makes master/worker more performant and better suited for cases where latency is an important factor.

MapReduce & Compute Grid: Summary

Although both Map/Reduce and Compute Grids provide a parallel processing model for solving a large- scale problems, they are each optimized for addressing a different kind of problem. MapReduce was designed to address shortening the time it takes to process complex data-anlytics scenarios. The results of the processing need to be returned in real-time, as the originator of the task normally blocks until its completion. Compute Grid applications are aimed at speeding-up the time it takes to process complex computetional tasks. the Job is executed as a background process that can often run for a few hours. Users don't typically wait for the results of these tasks, but are either notified or poll for the results. With MapReduce, the application tends to be data-intesive, therefore scalability is driven mostly by the ability to scale the data through paritioning. Executing the tasks close to the data becomes critical in this scenario. Compute Grid applications tend to be stateless, and normally operate on relatively small data-sets (compared with those of MapReduce). Consequently, data affinity is considered an optimization rather than a necessity.

When to use MapReduce, Compute Grids and Master/Worker?

  • If you need to agregate data that resides in a distributed file system then I would recommend the use of Hadoop and the like.

  • If you need to agregate data that resides in other data sources, such as an in-memory data-grid (IMDG), you should consider GigaSpaces, or a combination of compute grid and data grid products.

  • If your application is compute-intensive and relatively stateless in nature â€" you should consider the classic Compute Grid implementations.

  • If you're looking for a real-time (or near-real-time) and lightweight compute-intensive application, you should consider Master/Worker implementations

In reality, most compute-intensive application are not purely stateless. To execute the tasks the compute tasks need to process data that is coming from either a database or a file system. In small scale applications, it is common practice to distribute the data with the job itself. In large scale compute-grid applications, however, passing the data with the job can be quite inefficient. In such cases, it is recommended to use a combination of Compute and Data Grid. In this case, the data is stored in a shared data-grid cluster and passed by reference to the compute task. So we see the need for a combination of Compute and Data Grids becoming more common.

Too many options? Feeling confused?

At this point you may be scratching your head wondering whether or not your application falls precisely in any of the above categories.

A quick reality check will reveal that many existing applications consist of a variety of the above scenarios, mixed with traditional client-server models.

In such cases, attempting to use a different product for each scenario in our application is going to make things extremely complex.

How do we make distributed programming like MapReduce simple?

This question has been the driving force for many of our recent development efforts.

To simplify things, we realized that we need to:

  1. Grid enable existing programming models -- Use abstraction and virtualization techniques to introduce parallel processing as part of a normal client/server programming model.

  2. Reduce the amount of frameworks -- Provide a common model for using both parallel computing models: batch (compute-intensive) and real-time aggregation (data-intensive).
  3. Make data-awareness implicit with all APIs -- In reality, most application are stateful to some degree, so we need to make data-awareness implicit within our API and not as an afterthought. External integration solutions tend lead to complexity.

Where does GigaSpaces fit in?

GigaSpaces emerged from the tuple space model, specifically JavaSpaces, and was one of the first implementations of the Master/Worker pattern. At a later stage, we extended our JavaSpaces implementation to a full IMDG (In-Memory Data-Grid). In large scale compute grid applications, the GigaSpaces Data-Grid is often used in conjunction with other Compute-Grid implementations, either commercial or open source. This puts GigaSpaeces in a unique place, providing data-grid and data-aware compute grid capabilities using the same architecture. We also provide built-in integration of our Data-Grid with more advanced Enterprise Compute Grid products, such as those from DataSynapse and Platform Computing.

As of version 6.0, we offered abstraction layers (referred to as the Service Virtualization Framework or SVF) that take advantage of our existing space-based implementation in a way that doesn't require a complete re-write or a steep learning curve for developers who have already written their business logic as SessionBeans, Spring Remoting, RMI, CORBA, SOAP and other common Client/Server programming models. Our aim was to make distributed programming simple to the average programmer. We achieved this goal by following the same principles that I laid out above. For example, we introduced a set of abstractions on top of our space-based implementation. As we support both data distribution and task distribution, we are able to reduce the number of required frameworks and runtime components, as well as avoid the need for external services to ensure data affinity. In addition, we extended our support for aggregated MapReduce queries using a new Executor framework. With this we can support MapReduce and batch processing using the Master/Worker pattern and the *same* consolidated programming model.

The idea behind all this is to make scale-out development simple by making the API as close as possible to prevailing programming models, and by reducing the number of products and components required to scale either data-intensive or compute-intensive applications.

Final notes

The emergence of MapReduce specifically, and Grid computing in general, creates a need for another type of programming model currently missing in most existing mainstream frameworks and products. So far the solution has been to provide different specialized frameworks to to address each need. The fact that we have so many different frameworks (MapReduce included) makes things more complex.

On the Cloud Computing mailing list, Chris K Wensel wrote the following comment:

'thinking' in MapReduce sucks. If you've ever read "How to write parallel programs" by Nicholas Carriero and David Gelernter (http://www.lindaspaces.com/book/), many of their thought experiments and examples are based on a house building analogy. That is, how would you build a house in X model or Y model. These examples work because the models they present are straightforward.......If companies like Greenplum are using MapReduce as an underlying compute model, they must offer up a higher level abstraction that users and developers can reason in.

Indeed Making MapReduce part of mainstream development requires a higher level abstraction. The high level abstraction needs to provide means to use existing programming models on top of MapReduce to shorten the learning curve and transition from existing applications to distributed scale-out applications. Having said that, this is not enough, as we're still going to end up with multiple frameworks for addressing various parallel programming models that are not covered with MapReduce, such as Compute Grids and batch processing. It is therefore critical to map those different models into a coherent and consistent model that would support all various programming semantics, including MapReduce, Master/Worker and batch processing, in addition to the classic Client/Server model, with the ability to smoothly transition among them, without the need to switch or integrate different frameworks for each, and without the need to write our business logic in a completely differently way for each.

July 29, 2008

Can scaling be made seamless?

Putting together the two words "seamless scaling" in front of a technical audience is a very dangerous thing to do. The technically savvy folks are walking around with plenty of scars from previous attempts to scale their system - enough to know that there "scaling" and "seamless" couldn't be further apart. But nevertheless, in this post I'm going to take the risk and do just that :)

Basically what I'm going to try and argue is that while scaling can't be made seamless across the board, there are different techniques to make scaling seamless in certain scenarios, or at least very close to seamless. I will use GigaSpaces as an example of how to achieve seamless migration of existing JEE applications into a scale-out model, with zero or minimal change to the code. I'll also outline our general principles, which I believe are applicable to any application seeking seamless scaling.

The seamless scaling dogma

There has been a lot of discussion over the past year about different patterns of scalability. I devoted quite a few of my posts on this topic. Most of them centered around architecture - how we can use partitioning to avoid a data bottleneck, how we can use in-memory implementations to get better performance and concurrency compared to implementations based on the file-system, and how we can use an asynchronous event-driven architecture as a better way to scale our business logic.

Randy Shoup outlined these principles nicely in his infoQ article, Scalability Best Practices: Lessons from eBay. The dogma behind all these discussions and panels was that scaling requires a very rare set of skills, which average developers don't have, and that's why we're still seeing plenty of online system failure. The most recent was the iPhone launch failure.

Does scaling really have to be complex?

Well, if you look at Network Attached Storage as an example, you'll see there are alternatives to the traditional dogma around scaling. With storage systems, we don't really think of scaling that much. More so - our applications don't really need to be aware of the fact that they run over a local disk or a network -attached device. We can scale by adding disks, even hot-swapping them in some cases, even while our application is still running.

Now imagine what the world would look like if it wasn't that simple. If our application would need to be aware of what's behind the scenes of these storage devices and would have to be re-written to deal with these scaling issues. It's not that hard to imagine, is it? Most likely we would still have been talking about storage-related system failure as a result of bad architecture and implementations issues. But we don't have anything to talk about, because storage gave us a level of abstraction that enabled almost everyone, regardless of their skill-set, to deal with scaling without being an expert at it, or really even thinking about it much at all.

Can we learn any lessons from NAS about our ability to achieve seamless scalability?

Let's see what were the conditions that made seamless scaling with storage possible:

  • Well-defined interface (or abstraction)
  • Interface that fits the share-nothing approach to make it suitable for scaling
  • Simple interface
  • Widely-used interface

Now if we examine these principles as they apply to other layers of the application stack, we'll get a decent answer as to why we haven't been able to apply the same level of seamless scaling - which storage already provides - to these other layers.

In the data layer, the most commonly -used interface is SQL. SQL fit well with criteria (1) and (4) criteria but doesn't meet (2) and (3). HashTable fit well with (2) and (3) but unfortunately is less commonly used in distributed systems. JavaSpaces, like HashTable, fits (2) and (3) but is even less commonly used then HashTable. In the messaging tier, JMS fits well with (1), (3) and (4) but doesn't lend itself well to (2), and so on. And these are the cases where there is a well-defined standard. Unfortunately, in other layers of our applications it's even harder to find a well-defined standard that fits to all of these criteria.

To overcome this complexity, there have been other attempts to use the JVM bytecode as a lowest common denominator and introduce seamless scaling not at the middleware API level, but on the JVM level using bytecode manipulation. This seems like an elegant solution to the problem, however most of the existing distributed systems were not written as a standalone Java applications that get distributed by some sort of magic, so it fails mainly on the 4th criteria - it fits mainly to new applications that were designed with certain assumptions in mind about how the standalone Java code would behave in a distributed environment.

Now to the point - can we scale seamlessly?

Those who expect a simple yes-or-no answer to this question are going to be disappointed - there is no clear answer , because it depends on the specific application scenario, the way the application was written and the maturity of various standards around these applications.

In general I would say that Java-framework-based applications are in better condition then applications based on other frameworks, due to the maturity of the standards and the advanced layer of abstractions that are now available as part of framework such as Spring and Mule.

Seamless scaling at the application layer would most likely mean the ability to plug-in different underlying scalable implementations at the middleware layer (data, messaging, business-logic, presentation). The use of abstraction layers such as IOC in Spring/Mule and the new EJB3 abstraction gives more freedom to plug in different implementations that don't necessarily conform to the exact same standard API. That means that your code can remain intact when you plug in a different messaging implementation, for example, whether it is a JMS implementation, a space-based messaging, or remoting.

Some cases are going to be easier then others. For example, taking a SessionBean and scaling it by having multiple instances of that service running over a pool of machines, while viewing them all as if they where a single server, can be done through configuration changes only. We can do pretty much the same thing to the messaging layer, where we will have a virtual queue and topic rather then a centralized server.

On the data layer things are more tricky, as most of the commonly-used standards in this area don't fit criteria (2) very well. If our data model is built with a complex object graph, or if our queries depends on complex joins, then we're not going to be able to scale it out without changes to the code or to the domain model. But even in these more difficult cases, it's possible to minimize the scope of change by using the DAO pattern, declarative transactions and annotations as a mapping layer on top of the domain model. This means that even if the change can't be completely seamless, it will nevertheless be quite simple to achieve.

Learning from the GigaSpaces experience

At this point I'd like to use our specific experience at GigaSpaces to describe the methods we used to enable seamless scaling:

  • Use standard APIs, but only when it makes sense. For years we chose not to implement large parts of the JEE standard, such as EJB and Entity beans, because they didn't fit the scale-out environment and were too bounded to database. What I'm trying to say is that implementing a standard API is not always going to make the transition to scale-out model seamless, so you should be careful which standard you pick.
  • Leverage existing abstractions to plug in different implementations that are based on other APIs or technologies than the one originally used. We use this principle quite extensively in our OpenSpaces framework, to map our own transaction handlers, Remoting abstraction, to enable seamless scaling of SessionBeans , etc.
  • Use annotations for mapping between different models.
  • Use aspects to add new behavior when it makes sense. We use aspects in several cases such as filters/remoting aspects and security aspects. We will probably be using aspects more to address a more advanced level of serialization.
  • Apply more tightly coupled integration to specific products/frameworks  A good example for that is our Spring, Mule and upcoming web tier integration. This sort of integration enables an end-to-end seamless scaling story that makes the user experience significantly better. On the .Net side our integration with Office and Excel enables something equivalent.
  • Use open source as a tool to open up the framework for extensions and other integration work. This is something that we introduced quite recently through our new OpenSpaces.org community site and found it to be a useful tool with many extensions already available. GigaSpaces users implemented their own extensions and made them available through the community site. The most recent one has been Camel integration.

Real life examples

Of course, this isn't just a theoretical discussion - we've been attempting to achieve this level of seamless scaling in practice since we introduced our middleware virtualization stack, which was our first attempt to address scaling of existing applications and not just new applications.

We have been involved in numerous scenarios of scaling out existing applications. An interesting example is detailed in Mickey's recent blog post, in which he describes in more detail how he was able to scale-out a JBoss/Oracle RAC-based application. Mickey provides a good description with code snippets that show the before and after effects, both in terms of code changes and obviously scaling and performance. You can find the details of that experience here. The bottom line of this case study is the fact that he was able to get that application from 15tx/sec to 1500tx/sec in less then 4 days! For me, measuring the time it takes to move your EXISTING application and see the immediate results is the ultimate measure. You have to agree that if the transition to a scale-out model wasn't seamless, it wouldn't have been possible to do in such a short time, and more importantly, without ripping and replacing the entire application. In Mickey's case, we started with decoupling of the database to get the initial scaling, and replaced the other layers incrementally.

Summary

Storage taught us the lesson of seamless scaling. Seamless scaling can be achieved on other layers of our application as well, using a combination of Standard APIs, Abstractions, Aspects and tailored integration. In most cases, seamless scaling would mean no changes to our application code but would require changes to configuration and packaging. Not all layers can make a fully seamless transition. But in those more difficult cases, we can use the same principles to significantly minimize the changes required for scaling.

In this post i wanted to share some of our GigaSpaces experience in that area as i believe many of the lessons and principles are pretty generic and can be applied to any project/product. At this point it is also important to note that this is not a one-off proposition. It's a continuous effort and requires a long-term roadmap and commitment. We've been struggling with this for years and applied every possible method to achieve this goal. Some required significant re-factoring of our entire infrtustructure. The lastest one has been the addition of our OpenSpaces framework as an open source development framework based on Spring. With this change, we can easily support more APIs and frameworks, as well as build an entire ecosystem around it that will enable others to apply the same model to even more frameworks and applications very easily.

You may wonder why we, as a commercial company, would want to do this - after all it also means that GigaSpaces can be replaced much more easily. Well, the reason is fairly simple - we believe that our success and adoption will be much higher if we can get to the point where scaling any application through GigaSpaces won't require any changes to code. It took few years and an intensive effort to get a point were I can feel comfertable to use the two words "Seamless Scaling". Now we're starting to see the fruits of that effort - just see the recent post by Seon Lee who appears to be one of the Mule users: Mule 2.0 + GigaSpaces 6.5 = Pure Sex:

Gigaspaces released 6.5 with API integration with Mule 2.0 … this is just plain awesome. You can use Gigaspaces as the transport (e.g. in place of JMS) and quickly get a SBA up and running utilizing the same concepts I used at RHG when we were servicing B2B problems. You also get the advantage of the clustering ability and fault tolerance that comes with Gigaspaces – which is just pure sex – not to mention all the other great features that come with this advanced Javaspaces implementation (i.e. management tools, monitoring tools, data partitioning, performance features like batching).

I expect to see even more on that line with our latest 6.6 release which includes Seamless Scaling of Web application - check that out!

July 21, 2008

GigaSpaces is Available on Apache Camel

Apache Camel is a Spring based integration framework.
I was happy to see that David Greco released a JavaSpace connector for Camel based on GigaSpaces.

Quoting from David description of the connector:

"The javaspace: component is a transport for working with any JavaSpace compliant implementation, this component has been tested with both the Blitz implementation and the GigaSpace implementation .
This component can be used for sending and receiving any object inheriting from the Jini Entry class, it's also possible to pass an id (Spring Bean) of a template that can be used for reading/taking the entries from the space.
This component can be also used for sending/receiving any serializable object acting as a sort of generic transport. The JavaSpace component contains a special optimization for dealing with the BeanExchange. It can be used, then, for invoking remotely a POJO using as a transport a JavaSpace.
This latter feature can be used for an easy implementation of the master/worker pattern where a POJO provides the business logic for the worker.
Look at the test cases for seeing the various usage option for this component."

Interestingly enough I'm seeing more the use of space based transport used to drive this new type scale-out integration frameworks. Beyond the space transport i believe that Camel users can leverage the fact that they can use the space as a data-store for sharing the state between the various services without needing to go to database just for that purpose.

Nice work David!





My Photo

Twitter Updates

    follow me on Twitter