Guy Nirpaz, Uri Cohen and Shay Banon came up with an interesting exercise as part of the recent partner training that took place at the GigaSpaces office. In this exercise, the students were asked to come up with a scalable design for Twitter, using Space-Based Architecture.
There are some interesting scalability lessons from this exercise, which are applicable to anyone looking to implement new-style real-time web applications such as the ones used for social networking.
In this post I'll try to summarize the main patterns to put into place and considerations to make when designing such a scalable architecture.
For those of you who are not yet familiar with the service, Twitter is sort of a SMS-service meets discussion board. You can post short messages (up to 140 characters) that can be shared with a group of subscribers that are referred to as "followers". The main difference between twitter and other messaging applications is that both SMS and Instant Messaging (IM) applications were designed primarily for one-on-one communications whereis Twitter was designed primarily for broadcast communications (publish/subscribe, or pub/sub). Another aspect that is special about Twitter is that by default anyone can follow anyone else. In other words, it was designed for open communications, not private, as were IM and SMS.
What are Twitter's scalability challenges?
1. Sending a tweet (a message on Twitter is known as a 'tweet') -– The challenge is how to handle an ever-growing volume of tweets and re-tweets and responses that can lead to a viral "message storm"
2. Reading tweets – The challenge is how to handle a large number of concurrent users that continually “listen” for tweets from users (or topics) they follow.
Designing A Scalable Twitter
Choosing the right scalability patterns
Almost every challenge in software architecture has its roots in one of the existing patterns. So the simplest course is to start by looking for those patterns, and choosing the right patterns to scale the application. Looking at many other scalable architectures, we'll begin with a partitioning pattern as the core design principle. By partitioning our Twitter-like application we'll spread the load across a cluster of servers and scale by simply adding more servers (i.e., partitions). Another important architectural observation about Twitter is that it doesn’t fit into the classic database-centric design that most web applications do. On the flip side, it doesn’t fit well with a messaging-centric design (pub/sub) either. It is a combination of the two.
A pattern that is suitable for this type of collaborative messaging is known as a blackboard pattern. In our design, we will use those two design patterns -- partitioning and blackboard -- as the foundation for our scalable Twitter application. With the foundation in place, let’s list the requirements and examine how these patterns can be used to scale the app.
We'll assume a relatively extreme scaling requirement:
- Tweet Volume: 10 billion tweets per day
- Tweet Storage: 100 Gigabytes per day (with 10:1 compression)
- Tweets are limited to 140 characters
- Tweets are immutable, i.e., there are no updates, only inserts
- Twitter limits client applications to 70 requests per hour
Now that we have the foundational patterns and clear requirements, we can design the architecture. We'll start first with the blackboard system.
Using an In-Memory Data Grid (IMDG) as a Blackboard System
The are several approaches to building a blackboard system. To maximize performance and scalability, we'll store the data in memory, thus avoiding disk I/O, which is often the main cause for contention. For years, Java has provided a model for designing blackboard systems known as JavaSpaces. More recently, distributed caching has become popular and can provide similar capabilities to those of JavaSpaces. Let's examine two popular distributed caching approaches for our blackboard system:
- Simple read-mostly caching using memcached
- Read/write caching, also known as an In-Memory Data Grid (IMDG)
Choosing between memcached and an IMDG
Memcached enables us to to store the data (tweets) in a distributed memory set and read it in a scalable fashion. Having said that, be aware that memcached is not transactionally-safe and is not designed for reliability (i.e., it doesn’t support fail-over and high availability). That means that if we use memcached or something similar, we will have to use a database as the back-end. Every tweet posted will have to be written to both memcached and the database in a synchronous fashion to ensure that no tweet will be lost. This approach may be good enough for scaling read access, however, for writes and updates it offers limited scalability.
Unlike memcached, which was designed for simple read-mostly caching, In-Memory Data Grids are designed for handling a read/write scenario, and can therefore act as the system-of-record for both write and read operations. We can still use a database for long-term persistence, but because the IMDG maintains its reliability purely in memory, we can write and update the database asynchronously and avoid hitting the database bottleneck.
Todd provide a clear explanation of how an IMDG works (using GigaSpaces):
- A POJO (Plain Old Java Object) is written through a proxy using a hash-based data routing mechanism to be stored in a partition on a Processing Unit. Attributes of the object are used as a key. This is straightforward hash based partitioning like you would use with memcached.
- You are operating through GigaSpace's framework/container so they can automatically handle things like messaging, sending change events, replication, failover, master-worker pattern, map-reduce, transactions, parallel processing, parallel query processing, and write-behind to databases.
- Scaling is accomplished by dividing your objects into more partitions and assigning the partitions to Processing Unit instances which run on nodes-- a scale-out strategy. Objects are kept in RAM and the objects contain both state and behavior. A Service Grid component supports the dynamic creation and termination of Processing Units.
Back to our Twitter app: Given the scalability requirements, we will need to scale both reads and writes, and therefore, an IMDG is a more suitable approach to implementing the blackboard system.
Now let’s examine how the use of an IMDG as the blackboard system enables us to scale both sending and reading tweets. Let's start by designing the partitioned cluster.
Designing a partition architecture
One of the main considerations in designing a partition cluster of any kind is determining the partition key, such as a Customer ID in a CRM application or a Trade ID in a trading application. At first glance, it sounds like a trivial decision, but choosing the right partitioning key requires a deep understanding of the application usage patterns and data model. In the case of Twitter, we could choose to partition the application by the data-type, the user, the tweet itself or the followers. Our first goal is selecting a key that will that will be granular enough to enable scaling the application just by adding more partitions, while making sure that we don't end up with a key that is too fine-grained -- making it sub-optimal for querying purposes.
If we use the timestamp key, for example, our application will be optimized for “inserts” (writes), however, even a simple query such “retrieve the tweets of a certain user” will force us to execute an aggregated query against all partitions. Alternatively, if we partition the data based on user-id, we'll be able to easily spread the load from different users across partitions. Retrieving the tweets of a certain user is going to be resolved in one call to a single partition. We may encounter a problem if a single user generates a significant higher load than average, however, in the case of Twitter, we can assume that this is not very likely. Partitioning by user-id is a good compromise.
Data capacity analysis
With such extreme requirements it is clear that storing all tweets in memory is going to require huge memory capacity. Very quickly this will become economically prohibitive, so we need to devise a scheme in which the IMDG acts as a buffer for most of the load on the system, and then offloads the data and queries to an underlying persistent storage. In our Twitter example, it is fair to assume that most real-time queries (those that require fast access to the data) will be resolved in data from the last hour or 24 hours. Queries that require older data will need to hit the database for the initial call. However, subsequent access to fetch new updates should be resolved purely in-memory.
Using this approach, we'll need about 10 servers, each holding 10GB of data in memory to accommodate 24 hours of activity. If we also want to back up the data in memory, we will need double the amount of servers.
Choosing the right eviction policy
It's reasonable to assume that recent data is accessed most and older data is rarely used. To ensure that we get the maximum hit ratio on our memory front-end, let's choose a time-based eviction policy, which always holds the most recent updates in memory. When we will reach our memory capacity limit the oldest data will automatically get evicted from memory. The actual window of time in which we will be able to keep in memory is obviously dependent on the size of the cluster. With an IMDG implementation all tweets are stored in a persistent storage, which means that when tweets are evicted they are not deleted from the system.
Scaling tweet writes:
If we select user-id as the partitioning key, each user tweet will be sent to a specific partition. Multiple users may be routed to the same partition. Usually the algorithm to determine which partition fits a certain user is something like:
routing-key.hashCode() % #of partitions
In GigaSpaces, this is done by marking the routing attribute of our tweet class with an @SpaceRouting annotation.
The web front-end application will call space.write( new Tweet(..),..) to send the tweets. This way there is nothing in our web client code that exposes the fact that the underlying implementation interacts with a cluster of partitions (spaces in GigaSpaces). Those details are abstracted within the space proxy. When the write method is called on the space proxy it parses the field that matches @SpaceRouting from our Tweet() object and uses this field value to calculate the partition it belongs to. It then uses that value to route the Tweet(..) object to the appropriate partition.
With this approach, the web application can be written in a very simple way and can interact with the entire cluster as if it was a single server.
The data from the memory partitions gets stored asynchronously into a persistent storage. The persistent storage could be a database, but it could also be other things, such as an index search engine based on Compass/Lucene.
Scaling tweet reads:
To those familiar with messaging system, at first glance Twitter looks like a classic publish subscribe application. A closer look, however, reveals that any attempt to implement Twitter with something like a JMS message queue is going to fail in achieving a scalable system. This is especially true if you consider that the system needs to maintain a durable queue for each user. That could easily lead to a scenario in which each tweet is published to thousands of subscribers and every re-tweet can potentially lead to a "message storm".
As I discuss above, the right way to think about this type of application is as a blackboard pattern, just as a blackboard (or these days, a whiteboard) is used by a group of people (followers, in the case of Twitter) to share information and collaborate. When someone writes something on the board, everyone sees it and can choose to react. Unlike messaging (take email for example), we don’t need to send separate messages to each subscriber. Instead everyone is looking at the same board. Everything is also copied from the board to paper. When the board runs out of space, we erase it. And we can always page through the paper copy to access the board history.
In Twitter, this means that each follower that follows a group of people is basically polling for messages posted by those users from the last time he read them. To make things more tangible we can express this type of query with the following SQL syntax:
SELECT * FROM Post WHERE UserID=<id> AND PostedOn > <from date>.
The <from date> will normally be the last few minutes, if we're constantly looking for new messages.
But there's a caveat. Remember that we partitioned the application by user-id? This means that each user's tweets are stored in a separate partition. How can we read all users' posts? If we poll for each user individually, we will end up with a lot of network calls. The simplest approach would be to execute one call that looks for ALL the users we're following and look for updates (new tweets) from those users. The pattern we'll use to perform such this task is mapreduce. One way to do that with GigaSpaces is through the distributed task API:
The distributed task API is a modern version of the stored procedure. The following snippet shows what such a call would look like:
AsyncFuture<Long> future = gigaSpace.execute(new GetTweetsUpdates());
long result = future.get(); // result will be the number of primary spaces
The GetTweetsUpdates() class contains code that will be injected in each partition and will enable us to look for updates from the users we follow in a single call. Because the call runs in-process, and because the data is stored in-memory, executing such a task is extremely fast compared with the equivalent with database and stored procedure operations. Execution is aggregated to the caller implicitly. The caller can use a reducer to aggregate the results into a single result object.
Scaling the web front-end
Nothing really new here. We'll use a classic web front-end, which is comprised of a load-balancer and a cluster of web servers that act as a front end to our IMDG instances. The web application will use a single cluster-aware IMDG proxy to send new tweet posts. The IMDG proxy will be responsible for mapping the tweet with the actual partition that hosting the tweet. That logic is kept completely out of the application code. This allows us to keep our web tier clean and simple.
Keeping the web layer stateless to avoid session stickiness
One common pattern for keeping the web tier scalable is to use a Shared-Nothing Architecture, which basically means that the web tier will be stateless. This requires keeping the user session state external to the web-tier. As previously demonstrated, the IMDG can be used as high-performance, scalable data store for maintaining shared session state information. This allows us to avoid session stickiness and to scale the web tier without being locked in to a specific server throughout the entire session, in case the server is over-loaded.
For more information on how to scale the web tier, as well as other important capabilities such as self-healing and auto-scaling, see the following tutorial: Scaling Your Web Application.
Making it simple and cost-effective using cloud computing
Twitter is yet another example for a situation in which system load is highly variable and the difference between average load and peak load can be quite significant. In such cases, provisioning our system can be fairly hard and costly. This is where cloud computing and SLA-driven deployments can help us scale on demand and pay only for what we use.
Once we figured out a way to partition the application, it's going to be much simpler to package the application into self-sufficient units (referred to in GigaSpaces as processing-units) and scale the application simply by adding or removing these units on demand. You can learn more about this here
Scaling a real-time web application such as Twitter or Facebook introduces unique challenges that are are quite different from those of a "classic" database-centric application. The most profound difference is the fact that unlike with traditional sites, Twitter is a heavy read/write application, and not read-mostly. This seemingly minor difference can break most existing models for web application scalability. Using a combination of memcached + MySQL is not going to cut it for this type of application.
The good news is that with the right patterns and set of tools, building a scalable architecture that meets such challenges isn’t that difficult. There are already plenty of success stories that demonstrate that, such as the following example from highscalability.com: Handle 1 Billion Events Per Day Using a Memory Grid
The proposed architecture is by no means perfect and can be further optimized to meet even better performance and latency, but that will come at the cost of simplicity. I believe that the proposed architecture should get you pretty far as-is. Avoid going through more advanced optimizations until the point they are an absolute must.
- Scaling twitter presentation by Guy Nirpaz