In part-I of this post I tried to suggest a scalable architecture that can deal all three properties of CAP. Part II goes back to the definition of Availability and Partition tolerance and the different views on that regard. In this post I would like to elaborate more specifically on how the suggested architecture from part-I works in the context of all three CAP properties.
Recap from previous posts
According to the CAP theorem, one should choose two out of the three CAP properties: consistency (C), availability (A), and partition tolerance (P). Some suggest to choose AP and compromise on consistency. Others like Stonebreaker suggest CA as a better set of tradeoffs:
There is enough redundancy engineered into today’s WANs that a partition is quite rare. My experience is that local failures and application errors are way more likely. Moreover, the most likely WAN failure is to separate a small portion of the network from the majority. In this case, the majority can continue with straightforward algorithms, and only the small portion must block. Hence, it seems unwise to give up consistency all the time in exchange for availability of a small subset of the nodes in a fairly rare scenario
My suggestion followed a similar line of thought as presented by Dr. Stonebreaker, with the addition that I argued that we don’t necessarily need to give away partition tolerance completely when we choose CA.
I suggested an alternative approach where, instead of looking at each of the CAP properties in absolute terms and selecting only two out of the three, we could use a more relaxed version where we could apply various degrees of all three and compromise on the degree in which we apply each property based on the business requirements of our applications.
We will address the most likelihood scenario in which we expect to experience failure or partition tolerance and compromise in areas which are less likely to happen.
In Part I, I used one of the common GigaSpaces clustering topologies as a reference to that model. In the next section I’ll explain how that specific topology applies to all three CAP properties.
GigaSpaces clustering explained..
The following diagram taken from part-I of this series of posts describes a clustering topology that was designed to deal with scalability, throughput as well as consistency, availability and partition-tolerance.
How it works (Quoting from the original post…)
..we break our data into partitions to handle our write scaling between nodes. To achieve high throughput we use in-memory storage instead of disk. As in-memory devices tend to be significantly faster and concurrent than disk and since network speed is no longer a bottleneck we can achieve high throughput and low latency even when we use synchronous write to the replica.
The only place in which we’ll use asynchronous write is the write to the long-term-storage (disk). As the user transaction doesn’t access the long-term storage directly through the read or write path, they are not exposed to the potential inconsistency between the memory storage and the long-term storage. The long-term storage can be any of the disk based alternatives starting from a standard SQL databases ending with any of the existing
Now let’s examine how we can apply the various CAP properties into that architecture.
- Consistency under concurrent updates. To ensure consistency in the case of concurrent updates on the same data record each individual record is mapped to a single logical-partition at each given point in time. To ensure scalability, different records of the same logical table are written to multiple partitions in parallel as described in the diagram above. Each partition support the various locking semantics (pessimistic, optimistic (versioning), dirty-read) etc.. to control the concurrent access of the same record within the context of a single partition.
- Consistency between two or more replicas. To ensure the continuous high availability we keep one or more copies of our data. In asynchronous replication, we may end up with scenarios where read and write operations would hit two different nodes at the same time and end up reading two separate versions of that same data. There are various algorithms that were developed to handle that situation. In our case we chose to avoid getting into that situation in the first place through the use of synchronous backup. The performance overhead of the synchronous replication is fixed and is not proportional to the size of the cluster (each partition replicates data only to its backup replica). The replication to the database is kept asynchronous to reduce the overhead of writing to disk.
- Transaction consistency. Single operations or groups of operations can be executed under transactions. This ensures the ACID properties. Transactions can be made local to each partition. In this case, they will bound to the scope of a single partition and would be highly optimized in terms of performance. Transaction can also span between nodes (in this case the overhead is obviously going to be higher).
- Ordering – All operations are ordered based on the time they were written. This is specifically relevant to ensure the consistency between the in-memory cluster and the long-term storage which is being updated asynchronously.
- Primary failure. We always keep one or more replica nodes as a hot backup. The hot backup will take over immediately in case a primary node fails. The hot backup nodes uses synchronous replication to ensure no data loss before fail-over took place.
- Backup failure. When a backup node fails, the primary continues to serve requests and log the operation in a redo-log. In parallel, a new backup is being provisioned on demand to take over from the failed one. That process involves a provisioning process (in which a new backup is created) and a recovery process (during which the backup gets its state).
- Failure of multiple nodes. To increase availability, some of the NoSQL variants suggest at least three or more replicas per partition. In this way, we can handle simultaneous failure of multiple nodes. That obviously comes with a huge overhead – for each terabyte of data, we would have two terabytes of redundant information for backup purposes, and - equally important - we would also have the consistent overhead of keeping all of them up to date. An alternative approach is to use on-demand backups. On-demand backups are provisioned automatically as soon as one of the nodes fail. If spare capacity exists within the current pool of machines the backup will be provisioned into an alternate machine within the existing pool – this process can take a few seconds (depending on the amount of data per partition). If no machine is available, it will start a completely new machine.A new backup will be providioned into that new machine. The process of starting a new machine with its backup can take few minutes. As soon the node starts it will first use a primary election protocol to find the master node within its group and only then it boots up. The startup process include a recovery stage in which the node recovers its state from either the master node or the available replicas. The source node will also store all the updates since the recovery started in a redo log and would replay all updates to fill in the gap since the node started its recovery process.
- Client failure. Clients use a cluster-aware proxy to communicate with the cluster. The smart proxy ensures that a write or read operation is always routed to one of the available partitions. The routing happens implicitly thus the client is not exposed to a fail-over scenario.
- Discovery protocol. The GigaSpaces cluster discovery mechanism is based on the Jini specification. Services use the discovery protocol to find nodes within the cluster and share cluster state amongst all nodes.
3. Partition Tolerance
- Network partition between primary and backup. When the connection between two nodes fails, the primary node logs all the transactions into a fifo queue known as the redo-log. As soon as the communication gets re-established all the data gets replayed to the backup. If the backup fails completely, the system will start a new instance as described above.
- Network partition with the long term persistency. If the communication with the long-term persistency datastore fails, the replica will log all the operations till the connection gets re-established. The log is also replicated to a backup node to ensure that the data won’t be lost in case the primary partition fails before the data was successfully committed to the long-term storage.
- Network partition between two or more data center sites. Most people referred to scenarios where two sites can live in two separate locations and continue to work independently in a case of network partition as THE reference scenario for partition tolerance. It is important to note that there are two class of multi-site deployments that are fundamentally different as it relates to the network partition:
- Disaster recovery site – is often located in close geographical proximity to the primary site and is backed by high bandwidth and redundant network.
- Geographically distributed sites over internet WAN – in this case we have multiple sites that are spread over the globe. Unlike disaster recovery sites those sites tend to operate under lower SLA’s and significantly higher latency.
The recommended approach for dealing with multisite network partition would be fairly different in each of the categories that are mentioned above:
Disaster recover site. Disaster recovery sites are very much like any node in a local network but often live in different network segments and with higher latency than local networks. Nodes within a cluster can be tagged with a zones tag to mark their data-center affinity. The system can use this information to automatically provision primary and backup nodes between the two sites. It will use the zone tag to ensure that primary and backup are always spread between two data centers.
- Consistency & Availability - The consistency and availability mode would be just the same as LAN based deployment as described above but the performance and throughput per partition would be lower due to the higher latency associated with the synchronous replication.
- Partition tolerance – The system will continue to function even in a case where entire site fails or become disconnected. The system will continue to work through the available site. The system will also rebalance itself as soon as the communication between both sites gets re-established in order that the load will be evenly distributed between the sites.
Geographically distributed sites over internet WAN. In this scenario, nodes are spread over internet connections where the SLAs are lower and latency is significantly higher. In that case, it would be impractical to treat all of the nodes as a single cluster as in the previous case. It would be more practical to use a federated cluster deployment (a cluster of clusters) where we will use asynchronous replication to synchronize the multiple sites and therefore avoid the extreme latency overhead. In this mode, we can’t achieve all three CAP properties. In most cases it would be more common to choose AP over CA. Having said that even when we choose AP we don’t necessarily need to give away consistency completely. Here is a suggested architecture that can provides reasonable degree of consistency with slight compromise on absolute availability and partition tolerance:
- Consistency: Each site is the sole owner of its data – i.e. all updates on the data that belong to this site needs to be delegated to this site. In more extreme scenario we can define a master site which will own all the updates to the data by all other satellite sites. In that case we can ensure consistency and read availability over write availability.
- Partition tolerance: All sites use asynchronous replication to maintain local copies of the entire data set. In a case of a network partition between the sites each site can continue to read/write its own data even when other sites are not available. It can also read other sites data but will not be able to change any data that is owned by other sites. In that context we give away some level of partition tolerance for consistency.
- Availability: Local failures are dealt with through the same model as described above. Updates on data that belong to other sites will be blocked. In that context we give away some degree of availability for consistency.
Note – Another option to deal with consistency under concurrent updates between two sites can be based on versioning. In that case it is assumed that the latest version wins or the system will delegate the decision for resolution on conflicting updates to the application. This is a more complex scenario that goes beyond the scope of this post.
Handling Extreme Failure:
Multiple node failures at the same time. In cases where both the primary and backup of a given partition fails at the same time, we will consider this cluster as non functional and block operations to that cluster until the cluster becomes functional again. In that case we will compromise on certain aspects of Availability for the sake of Consistency. If your application can tolerate the potential consistency issues associated with such failure, you could configure the cluster to route all operations to currently available partitions. In that case we will trade Availability over Consistency – this behavior can be useful in streaming scenarios where the system is used to pass through events.
Note that if we choose to use a proper disaster recovery site setup where we ensure that primaries and their backups never run on the same site the chances for having such a failure is close to zero in the first place and can be considered equal for having the two sites down at the same time.
Recovery from total failure. When the entire system crashes, it will boot itself from the long-term persistent storage or from a snapshot (new). In that case we may lose some of the data that was kept in memory and wasn’t yet delivered to the long term persistent storage. It is important to note that in a disaster recovery setup that event can happen only when both sites goes down at the same time. In that case the system would recover itself from the data that was last updated in the long-term storage.
In many of the recent discussions on the design of large scale systems (a.k.a. Web Scale) it was argued that the right set of tradeoffs for building large scale systems would be to give away Consistency for Availability and Partition tolerance. Those arguments relied on the foundation of the CAP theorem developed in early 2000-2002. One of the core principals behind the CAP theorem is that you must choose two out of the three CAP properties. In many of the transactional systems giving away consistency is either impossible or yields a huge complexity in the design of those systems. In this series of posts, I've tried to suggest a different set of tradeoffs in which we could achieve scalability without compromising on consistency. I also argued that rather than choosing only two out of the three CAP properties we could choose various degrees of all three. The degrees would be determined by the most likely availability and partition tolerance scenarios in our specific application. The suggested model was based on the experience we had in GigaSpaces over the course of the past years and was successfully deployed in many mission critical systems today in Finance, Telco and ecommerce business. I hope that through the sharing of this experience we could come up with a broader set of patterns on how to build large scale systems that would fit also to mission critical transactional systems.