When we think about cloud computing, the first thing that comes to mind is on-demand computing. Indeed, cloud computing and virtualization providers made it easy to get a machine by just providing a credit card and making an API call. However, that being said, what happens after you get that pool of machines? How do you deploy the application into these machines? How do you know how the application behaves? How can you react when something breaks or when the load on the system gets too high?
In the human world, many organizations face pretty much the same challenges. It is fairly easy to get a good infrastructure for business – office space, electricity, telephones, computers – but as a manager, it’s really difficult to place the right people in those offices, give them sufficient work but not so much that they’re totally overloaded, track what they’re doing, and react properly when something goes wrong. A recent and very successful methodology for dealing with this challenge, which is quickly gaining popularity, is called Agile, Scrum, or Lean. The common ground of this family of methodologies is that unlike the centralized management approach, we work through short iterations and continuously correct the course of our actions based on the actual progress. We rely heavily on the individuals to take responsibility for their actions and be part of the decision-making process. We assume that those individuals will take their own corrective actions for things that fall under their direct responsibility. The program manager is only responsible to make sure that the course (and budget) is kept on the right target and shift resources or allocate new ones if needed.
Today’s Operations and Applications – Just Like Centralized Management
Chances are that you are aware of the Agile methodology and recognize that it solves many of the huge difficulties of traditional, centralized management. But at the same time, if you think about it, you are still running your applications and IT operations according to the defunct, centralized management paradigm. We’ve invested years in creating an artificial separation between what is known as “operations” and “applications”. In this way of thinking, “applications” is responsible for maintaining a certain business logic and “operations” is responsible for monitoring and maintaining the application. Operations management is responsible for monitoring the CPU, memory and every individual aspect of the network configuration, application configuration, and consumption of resources. The application is not really aware of any of this, and usually you rely on the operations guys to take corrective action, or at least give you a call and let you know, when something out of the ordinary happens.
This seems natural because it’s a pattern we’ve all become accustomed to. But just imagine managing a real life project (with people) using this same approach. It would take the centralized management approach to frightening extremes. Here’s how it would work – there is the project manager (operations) and the employees (application), with absolutely no direct communication between them. Each employee has a set of sensors attached to her, which report on health conditions (heartbeat, blood pressure, and so on), and another set of sensors that report how quickly she is working. The project manager views all this data in a nice dashboard with nice graphs and all sorts of reporting schemes. Whenever there is an alert – for example, one of the employees is running a fever, or is working very slowly for some reason, the project manager troubleshoots the issue and sees how to solve it and whether it affects the project.
That’s ridiculous, isn’t it!?
In real life, especially when working with Agile, employees are responsible for their output. If they don’t feel well, or are working slower than usual, only the individual employee can really know what the problem is – or whether there is a problem at all. They take a rest or figure out what else they can do about it. They only report back to the project manager if they really can’t deal with the issue themselves and it’s starting to affect their ability to meet the project goals. The manager doesn’t need to constantly monitor their health or throughput – the only thing worth monitoring is if they can or can’t do their job. The reason is less of interest. More importantly, the manager relies on the employees to report when such an event happens, and to take corrective action themselves.
If that’s so obvious, why can’t we apply the same same common sense to the way we run our software? Why can’t we make our applications aware of their condition and able to take some responsibility for themselves? This is important in any computing environment, but I would argue that it is especially important on a cloud or in any virtualized environment, in which the operations/application paradigm totally breaks down.
The Solution: Bringing Operational Awareness to your Applications
Like a real-life employee, the application needs to be responsible for its own health. It needs to report only on general status and any shortage of resources. The operation guys don’t need to continuously monitor the health condition of each individual application component. They should rely on the applications themselves to send the right warning signals before something breaks, and suggest corrective actions in a language that the operation guys can understand. For example: “I need X amount of additional resources ” or “I can release Y amount of resources after I've done my duty.” Deciding which specific task needs to be assigned to each of the available resources should be the application’s responsibility – in any case, it probably knows best how to utilize these resources.
To make this a reality, you need to add operational awareness to your applications. By operational awareness I mean that the application needs to be written with the assumption that it can be asked to be moved at any point in time, and the application needs to take responsibility for events that can be solved at the application level, and take corrective actions to meet its SLA. The manager should only be notified in cases where the application has exhausted its physical resources and there are no other available resources. For example, if one of the instances of the application runs out of memory or needs more CPU power, and there are spare resources that are not utilized by other resources within the application, it is the application’s responsibility to balance its resources within its own realm. But even if there are no available resources, the manager doesn’t need to bother with the fine-grained details of how to balance the individual application resources. Balancing these resources should be kept within the scope of the application itself. The manager should only bother to allocate the pool of resources between the various applications.
Learning from Amazon EC2
A working example of this approach is the Amazon EC2 API. In my view, one of the revolutionary concepts that made EC2 so successful, was the realization that rather then creating high level tools for managing the data center, Amazon chose to provide an API and let the application developers figure out how to run their application. The API opened up a whole ecosystem of third party solutions, and a whole new market. What’s interesting is that in EC2, when I want to auto-scale my application, I don’t call the operations guys to do that. There is a very slim chance that someone who doesn’t understand my application would be able to write such a policy without breaking the application. Most of the applications that run on EC2 just call the EC2 API when they want to spawn a new machine or manage their high availability, or auto-scaling of their application. The nice thing about the API is that it doesn’t prevent Amazon and others coming up with more high-level tools that are much more useful for the operation guys at a later stage. The main difference is that the application takes more responsibility for ensuring that it meets its SLA, and the operations guys get the type of information that makes sense to them, and more importantly, something that they can easily act upon and take the appropriate corrective actions.
The Need for an Application Cluster Management API
Amazon provides a comprehensive set of APIs that work at the Virtual Machine level. Back to our real-life project analogy, the Amazon tool provides us with the real-estate - the office space, the network, the computers. What we don’t get is the access to the individuals that do the actual work. For this reason, keeping the API at the virtual machine level is often not granular enough, as it doesn’t enable us to interact with the individual application components, manage and automate the application’s deployment lifecycle, define its dependencies and distribution model. There are other frameworks such as Capistrano that became popular, mostly in managing Ruby deployments. This type of tool is becoming popular in other cloud deployment settings as well. Capistrano can address the application deployment and basic life cycle management, and provide a good foundation for automation of those processes.
While the combination of the two is a good starting point, it is lacking the notion of application cluster. In an application cluster, you want to have more explicit semantic support for managing deployment topologies (partitioning, replicated). You need built-in support for defining backup and primary, and you need to be aware of the data affinity requirement of certain elements in the architecture. Without all this, enabling the application to re-balance its resources when a new machine joins or leaves the network, is close to impossible.
The Interactive Cloud
The combination of image-level provisioning and an application cluster management API, leads to a new form of cloud deployment experience that I refer to as the interactive cloud.
The interactive cloud operates fairly closely to the way we manage human-led projects. The image API is equivalent to the physical resources that are critical to run our project, e.g. the office, the computer, and the network. Another part of the project is managing the individuals that run the project. Learning the right way to manage our teams, from the lessons of Agile project management, is through a close iterative process. This is where the Application Cluster Administration API comes into play. This API gives the application developer the ability to control and automate application deployment, and continuously balance the way the application runs, to ensure that it meets its required SLA. It also provides a callback mechanism when you run out of resources and when some of the idle resources can be released. You can also easily generate an aggregated view for operations (without going down to the level of what each individual component did at each given point in time).
With this combination, you can spawn new applications, automate their deployment and lifecycle, start new machines as needed and keep a close watch on your application behavior after it has been deployed. When a new machine joins the cluster, the application automatically rebalances its resources to take advantage of that new machine, and when machines fail or are removed from the cluster, the application rebalances itself on the smaller resource pool. If something breaks, the application automatically scales down, allowing it to continue working despite the failure, and in the background allocate an alternate resource to fit in instead of the new one.
IMO, a cloud/virtualized environment forces us to go in this direction. This is an area where the old way of separating “applications” from “operations” is just not going to cut it.
Final Words
There is nothing really new about the solution I propose here, and the concept behind it. Quite the contrary. The interactive cloud allows you to apply proven methodologies that work well for many critical projects in the real world, and apply them to the way you run your applications. Most people who run their application in the cloud have already started to build and manage their applications in this way, using tools like the Amazon EC2 API and Capistrano. However, this was mostly done as an afterthought and was mainly driven by intuition and common sense. My attempt in this post is to back up these initial efforts with a clear methodology.
In my next post (Adding operational awareness to your application), I will use the Administration and Monitoring API, new in GigaSpaces XAP 7.0, as an example for such a framework. It is worth noting at this point that this API is actually provided for free as part of our Community Edition, so anyone can simply download it and try out the examples I will provide in the next post. I’m looking forward to showing that this vision of the interactive cloud is really very tangible and is already at everyone’s fingertips.