In one of my recent posts I used a production line analogy to describe the applicability of methodologies like "Lean" taken from production line optimization to transactional systems.
As a reminder, here's a good explanation about Lean from Amit Rathore's blog:
Ultimately, it focuses on throughput (of whatever is being produced) by taking a strictly system-level view of things. In other words, it doesn’t focus on particular components of the value-stream, but on whether all the components of the chain are working as efficiently as possible, to generate as much overall value as possible.
And from another post by Amit:
In many cases, much more bang for the buck can be got by simply looking at an extended value-stream, as opposed to a localized one.
Transactional systems have many similarities to production lines. In both cases, we're dealing with pipeline optimization.
In this post I provide a more specific example that illustrates how to apply Lean/Agile methodologies to scale-out transactional systems. This new category of transactional systems is also referred to as Extreme Transaction Processing (XTP).
Transactional system example
We'll start by examining a typical pipeline of a
transaction processing system.
The pipeline consist of the following steps:
- Step 1: Send business request to messaging system
- Step 2: Take the business request, typically through Message-Driven Beans(MDB), and process it
- Step 3: Update the state of the transaction
- Step 4: Trigger an event that will start the next step in our transaction system
- Repeat 1-4
Ina real world system, the business request might be might be market data feeds, monitored events
in a management system, credit card verification requests and so on. The processing of the request might include parsing, validation and/or normalization, which transform the raw feed into
something more meaningful. A next step will usually be some sort of "matching" and
"processing" – that is where the specific business logic comes in.
The full business transaction requires the completion of
all these steps.
Adding reliability and consistency
Now let's examine what happens to this system when we add high-availability requirements:
- Step 1: Send business request to messaging system
- Step 2: Replicate the message to a back-up node or disk
- Step 3: Open an XA transaction
- Step 4: Take the business request (typically through MDB) and process it
- Step 5: Update the state of the transaction (under the same transaction)
- Step 6 Replicate the transaction state to maintain high-availability of the database
- Step 7: Trigger an event that will start the next step in the transaction
- Step 8: Commit the transaction (message sent to the transaction coordinator)
- Step 9: Transaction coordinator call prepare on the messaging system and database
- Step 10: Transaction coordinator call commit on the messaging and data system.
- Repeat 1-10
The steps listed above constitute what Lean defines as
an end-to-end system view. Looking at this view it becomes pretty clear that an attempt to optimize
a specific tier (e.g., data, messaging) is only going to produce marginal
value (the system is only as strong as its weakest link).
Another thing that we learn is that adding reliability requirements more than doubled the amount of network hops in our system!
The reason for this is related to the tier-based approach. In our case, the tiers are the messaging system and the data-tier. Because each tier is independent of the other, it has its own high-availability model and its own configuration and setup. We need to "pay" the cost of reliability over and over again for each tier. As the tiers represent separate systems, we also needs to add the overhead of external systems, such as a transaction coordinator, to ensure the consistency of the two separate systems.
Things become worse when we add scalability requirements.
In most systems, the messaging-tier and data-tier are implemented as centralized servers, and the same applies to the transaction coordinator. What that means is that these centralized points become a bottleneck when we try to scale the system. This bottleneck means that the more transactions we try to process in parallel, the longer it takes to complete each step mentioned above. Furthermore, in many cases the overhead is not linear, especially when disk I/O is involved.
It therefore becomes almost impossible to try and provision our
system based on our scaling requirements before we build the entire system and
test it. This leads to a huge risk in project management, and that is one of the main
reasons many projects managers are surprised to discover toward the end of the
project that the PoC and testing that they conducted earlier, do not represent their *real* system behavior in terms of throughput
and latency. In many cases they have to go back to the drawing board for a
very painful optimization cycle, or even a complete re-design to meet their goals. At this
point, many projects simply fail.
The limitations of scaling on a per-tier basis
Although some implementations of messaging and databases provide partitioning, trying to break those centralized implementations on a per-tier basis -- whether it’s the database or messaging system -- is going to be extremely hard and very complex to maintain. It also introduces another interesting problem, which is the affinity problem. Meaning, if we break each tier into multiple units how do we ensure that the transaction is routed to the "right" unit when the transaction moves from one tier to the other. In many cases this leads to complexity in the code or to a requirement for yet another external system that will act as the coordinator for message routing.
How can Lean and Agile methodologies help
The first step in solving a problem is identifying where the
problem is. By taking the "system view" we will have a better picture of where the
problem may lie. We can now choose to either address incremental optimizations or eliminate the
Applying incremental optimizations
With incremental optimizations we identify which steps are slowest, and try to improve them. In the example above, the database would seem to be one of the more obvious bottlenecks. We can reduce the time it takes to complete that step by front-ending the database with a distributed cache, also known as an in-memory data grid (IMDG).
We can also try to minimize the use of distributed transactions by
adding awareness to our code. In a scenario in which the same transaction
may be processed more then once due to failure, this semantic is
referred to as idempotent.
The limitations of incremental optimizations
Such optimizations are valid, and in some cases, sufficient. But they are just a painkiller for the symptoms and don't solve the underlying problem, which is a result of the tier-based approach. We also need to remember that writing code that deals with idempotency is complex and error-prone as noted in Johan Strandler's article on InfoQ, New patterns and middleware architecture needed for true linear scalability?
The lack of transactional atomicity between entities and the messaging introduced by this, causes new problems that lurks it's way all up to the business logic; message retries and processes that must be able to handle idempotence.
Achieving end-to-end scalability
Moving to a tier-less approach using Space-Based Architecture (SBA) will help us eliminate all of the bottlenecks described above and transform our transaction processing system into one that fits the definition of Gartner's Extreme Transaction Processing (XTP).
This pattern has the following principles:
- Implement a common cluster for the entire system, including messaging and data. A common cluster eliminates the need for redundancy when we introduce fail-over. It also removes the need for a transaction coordinator, as we are no longer dealing with coordination of two separate sub-systems (we also eliminate the need for idempotency).
- Remove disk I/O from the critical path of the transaction. The state of the messaging middleware and in-flight transactions will be stored purely in-memory. The system will replication to keep a copy of the data in an alternate memory instance for hot fail-over. Synchronization with the back-end database will be a background process using reliable asynchronous replication. This way we guarantee the consistency and availability of the system purely in-memory, without depending on disk storage. This approach has a nice cost-benefit to it as well, as it reduces the need for expensive disks as part of the infrastructure.
- Collocate the business logic with the data. This reduces the network hops as well as the number of moving parts in the system.
- Partition. We split (or "shard") the transactions into self-sufficient units of work.
In SBA terminology, steps 1 and 2 above are handled by a Processing Unit. Processing units are the unit of scale and fail-over in the system.
Seamless transition through middleware virtualization
Another aspect of SBA is the abstraction of the API from the underlying implementation. With SBA, the API is basically a façade that exposes certain application semantics (messaging/data). This provides the benefit of the tier-less in-memory approach, without fundamental changes to the existing programming model.
For example, we can still use JMS for as the messaging API, POJOs for implementing the business logic, and DAO for abstracting the data model (more in The Missing Piece in Cloud Computing: Middleware Virtualiztion).
Real-life example: from tier-based to tier-less in 4 weeks!
Toward the end of last year GigaSpaces engaged with a large wireless carrier that was about to launch a new campaign. The company was concerned about the scalability of their existing tier-based Order Management System due a recent failure in an earlier launch. They heard about Space-Based Architecture and wanted to use GigaSpaces to implement this pattern. A major hurdle was that most of the code was already written for a classic tier-based J2EE model and they had only 4 weeks to integrate with GigaSpaces and go directly into production.
We established a tiger-team that integrated GigaSpaces XAP with their existing application, and within two weeks completed development and started stress testing. The system successfully went live before the holiday shopping season and served an extremely successful campaign. The diagram below shows the before and after architecture of this particular application:
Figure 1: Before -- Tier-based order management system
Figure 2: After -- Scale-out architecture based on GigaSpaces
The end-to-end system view highlights some of the fundamental problems with Tier-Based Architecture (TBA). TBA leads to tier-based thinking: instead of looking at things from an end-to-end perspective, we look at them at the tier level. In retrospect, it's amazing to see how far this thinking has seeped: it defines how we evaluate products, how we tests systems, how we build development teams, and even, to a degree, the organizational structure. It even created an entire ecosystem of products, such as performance monitoring tools, aimed at optimizing the tiers problem. It took the manufacturing industry decades to realize that this is a major problem that requires a new way of thinking. Hopefully, the software industry will come to this realization much quicker. We are already seeing signs of a move away from tier-based thinking. A good example is the transition from the waterfall development approach to Agile and Lean development methodologies. Aligning application architecture with this development model is a natural next step.
The model is mature and there are already many production references using it, as indicated by the example i described earlier, and the testimonial below from Ashmit Bhattacharya, who runs development for Blackhawk Network, a Safeway subsidiary that is the largest gift card processor in the world:
We were introduced to GigaSpaces by our enterprise
service partners, and they decided to come in and integrate with our existing
solution as part of the POC. The solution was demonstrated in less than 3 weeks
and we had a linearly scalable solution at the end of the exercise. ...
..The best part of the solution in our particular case was the manner in which the solution scaled horizontally. This took a tremendous burden off my architecture teams and we could focus more on functional development of our solution rather than work on the framework.
I'm hoping that the work that we're doing at GigaSpaces will help bring this much needed change faster.