Designers of distributed systems are forced to deal with many complex issues such as high availability, scalability, security, performance and latency. We often address these issues with brute force development work, thereby creating a complex development process. As a result, testability is often ignored or at least it is considered a low priority issue.
Testing complexity grows exponentially when we add high availability and scalability to the picture simply because it is very hard to set up an environment in which we can test our code and apply our changes. Consequently, high availability and scalability tests are often performed at a late stage in the development cycle. Not surprisingly, we often find that scalability and high-availability has a significant effect on our application behavior, especially when it comes to performance and latency. Because this issues is only discovered at a late stage of the development process, projects get delayed. In many cases, we need to get back to the whiteboard and redesign our application based on the new findings and the cycle starts all over again.
This cumbersome process directly affects application quality, stability and reliability. I've witnessed too many cases in which a simple bug brought an entire system down and led to a loss thousands of dollars. The fact that many transactional systems today are online systems makes those failures extremely visible, sometimes showing up in the 6 o'clock news.
Testability should be equally important to the other "ilities" in a new design of a distributed application.
How can we improve the testability of distributed applications?
I've listed below a set of key principles based on our experience at GigaSpaces in internal development and when working on customer implementations:
Start with the architecture:
If the architecture is complex, everything else will be complex -- including testing, scaling and fail-over.
Measure scalability and fail-over as part of the early test environment
Our goal should be to test our application behavior under fail-over and scaling events as early in the development cycle as possible. One way to achieve this is by running a small-scale of my application in a single machine. By small scale I simply mean small volumes of data and concurrent users. All other aspects of my application (clustering, high-availability) should be the same as what I plan for my deployment environment.
Enable fast iterative development
The cycle of applying changes to our code and testing it in our production environment should be very short. As we had witnessed with J2EE, if that cycle becomes too long we simply start to avoid it, and as mentioned earlier, that results in problems being identified late in the game, resulting in project delays.
Keep the configuration consistent
Developers and architects often tend to think that if they only change the configuration and keep the code the same between the testing and production environments all will be well. As we all know, changing the configuration affects the way applications behave. It is, therefore, quite common to hear: "oh this was a configuration issue.." after we several days were wasted on trying to figure out what went wrong.
In Part II of this topic I'll discuss how we're following these principles and addressing this issue with the OpenSpaces framework we released to GA recently as part of GigaSpaces XAP 6.0 (download here).