Learning from the experience of others has always been a great source for many of the posts in this blog.
I thought that the information that Ron gathered for this purpose could be extremely beneficial for everyone who is either running, or plans to run their application in the cloud.
In this post I tried to capture in words, the content of Ron’s presentation on the lessons from recent cloud outages.
Recap - History of Cloud Outages
- 21 April 2011 - Some parts of Amazon Web Services suffered a major outage. A portion of volumes utilizing the Elastic Block Store (EBS) service became "stuck" and were unable to fulfill read/write requests. It took at least two days for service to be fully restored. Reddit, one of the better-known sites to go down due to the error, said it has 700 EBS volumes with Amazon. Sites like Quora and Reddit were able to come back online in "read-only" mode, but users couldn't post new content for many hours.
- 29 June 2012 - Several websites that rely on Amazon Web Services were taken offline due to a severe storm of historic proportions in the Northern Virginia area where Amazon's largest data center is located.
- 22 October 2012 - A major outage occurred, affecting many sites, again such as Reddit, Foursquare, Pinterest, and others. The cause was a latent bug in an operational data collection agent.
- Christmas Eve 2012 - Amazon AWS again suffered an outage, causing websites such as Netflix instant video to be unavailable for some customers, particularly in the Northeastern US. Amazon later issued a statement detailing the issues with the Elastic Load Balancing service that led up to the outage.
Cloud outages are not the sole property of Amazon – they’re everywhere
While most of the more notable failure events happened to be related to Amazon AWS, failures tend be in direct proportion to the usage of the infrastructure, and right now Amazon is probably running the biggest workloads on their cloud, and is growing pretty fast. With these statistics, it’s very likely that their failure would be more notable than others simply because they simply have a wider impact.
As we have experienced, failure has dropped a visit to other cloud infrastructure providers as well.
Microsoft Azure outage
- 28 December 2012 - some owners of Microsoft's XBox 360 gaming console were unable to access some of their cloud-based storage files.
- 26 July 2012 - Service for Microsoft’s Windows Azure Europe region went down for more than two hours
- 29 February 2012 - The ultimate result was service impacts of 8-10 hours for users of Azure data centers in Dublin, Ireland, Chicago, and San Antonio.
Main take way
Looking at all of these failures, it becomes apparent that they don't quite follow a common pattern.
Failure tends to happens when and where you least expect it.
Rather than relying on the infrastructure to prevent such failure from happening, we need to learn how to cope with them as a way of living.
What does 99% availability mean anyway?
Quite often when we talk about availability, we’re referring to % of uptime.
In this context, for most people 99% uptime sounds good enough. Is it?
Let's examine what that means in days:
- 99% - 3.65 days downtime
- 99.9% - 8.76 hours downtime
- 99.99% - 53 minutes downtime
- 99.999% - 5.26 minutes downtime
99% downtime means that we need to be ready to tolerate almost 4 days of downtime, and no one can ensure how those 4 days will be spread across the year.
The impact of cloud outages
Although AWS went offline for a few hours only, the downtime experience did have an impact on customers’ businesses. There is no known data for the number of people affected by a cloud computing service outage. It is estimated that the travel service provider Amadeus loses $89,000 per hour during any cloud computing outage, while Paypal loses around $225,000 per hour.
How to survive cloud outages - (Lessons from RightScale & Netflix)
A good source for learning how to survive cloud outages is Netflix and RightScale who have had a good track record for surviving many of the previously mentioned cloud outages.
Below is a summary of the main takeaways.
- Make sure to have a dedicated expert to manage your disaster recovery (DR) architecture, processes and testing.
- Define what your target recovery time and recovery point is.
- Be pessimistic and design for failures – (assume everything will fail and design a solution that is capable of handling it).
- Avoid single points of failures – all parts of your app should be highly available (different AZ / regions / cloud) – load balancers, app servers, web servers, message bus, database.
- Use monitoring and alerts for failover processes and for every change in state.
- Document your DR operational processes and automations.
- Try to “break” different parts in your application. From unplugging the network, to turning off machines…then try it again.
Netflix didn't just provided its share of advice but has started to open source many of the tools it uses internally. The first one of them is ”Chaos Monkey.” This tool was designed to purposely cause failure in order to increase the resiliency of an application in Amazon Web Services (AWS.)
Designing your application to survive failure – It’s all about redundancy
Netflix provides an excellent toolset for surviving outages on Amazon on the operational level.
In this section, I wanted to zoom-in more on the design implications on our application.
The core principle for surviving failure is actually fairly simple, and in fact, applies to any system - not just cloud - whether they happen to be airplanes, missiles, cars. At the end of the day, it’s all about redundancy. The degree of tolerance is often determined by how many alternate systems or parts of the system we have in our design, and how separate they are from one another. The degree of tolerance is also determined by how fast we can detect the broken part in our system and make the switch.
In software terms, the common parts that comprise our system are built out of two main groups - the business logic and the data.
Making a redundant software application that can survive failure is often based on setting up clones for two of those parts of our system.
Cloning your data
There are two models for creating redundant data systems.
1. Use database replication
2. Use generic replication as in the case of CDN
For database replication - each database tends to have a different replication scheme. Amazon RDS is based on a read replica that can take over when the master node fails. More modern databases, such as Cassandra and MongoDB, tend to have more flexibility and control for setting up the database replication. So the first choice that you need to make for setting up your data redundancy, is choosing the right database.
Quite often, using database replication tends to be good enough within a certain geography but can be too fragile for geo-redundancy. A model that has proven itself for replicating data across WAN is CDN.
What’s more, CDN is not tied to a specific data source, and therefore can use a generic service for replicating data from multiple sources that doesn't necessarily reside within a certain database.
Having said that, CDN is fit mostly to read data and doesn't ensure the consistency of the data. For that purpose, we need to use a generic replication service that could plug into any data source and replicate it to other locations in a way that will allow us to control the replication route, the latency, and consistency.
Cloning your application
To clone our application business logic, we need to be able to ensure that all parts of our system run the exact same version of all of our software components. That includes not just the binaries, but also the configuration, the scripts that run our application, and more importantly that all our post-deployment procedures such as failover, scaling and monitoring are also kept consistent.
Quite often the things that makes the cloning of our business logic complex, are due to the fact that the information on how to run our application is often scattered across many different sources such as scripts, as well as the mind of the people that runs these apps.
To make the job of cloning our application much simpler, and thus more consistent we need to be able to capture all parts of the information for running our apps in the same place.
Configuration management tools such as Chef, Puppet and in the case of Amazon CloudFormation can help in this regard.
Making it simpler through Cloudify
To make the work of setting all this up simpler, we tried to bake all those patterns into ready-made tools scripted into out of the box recipes.
Cloudify recipes include:
- Database cluster recipes with support for MySQL, MongoDB, Cassandra, Postgress...
- Integration with Chef and Puppet
- Automation of failover, scaling and continuous maintenance of our application.
- Application recipes that allow you to capture all the aspect of running your application, including the post-deployment aspects such as failover, scaling and monitoring.
This is a good example for setting a redundant web application between sites using Cloudify.
Cloud brings lots of promise for making our business more agile.
Cloud has also become a huge shared infrastructure in which every failure has a much more significant impact on our business worldwide.
The experience in the past year has taught us that even a robust cloud infrastructure such as Amazon can fail.
Through this experience we've learned that rather than relying on the infrastructure for preventing failure we need to design our system to cope with failure and get used to failure as away of life.
Having said that, the investment required to build a robust application can be fairly large and not something that everyone can afford.
Using tools like Cloudify, Chef, Puppet and if you’re a pure Amazon shop Netflix could help to greatly reduce this effort by pre-baking a lot of these patterns into recipes.
- Cloudifying High Availability
- AWS Outage - Moving from Multi-Availability-Zone to Multi-Cloud
- Lessons From The Heroku Amazon Outage
Follow Nati on Twitter!