Over on HighScalability.com Todd Hoff posted one of the comprehensive
articles on latency that I've read titled Latency
is Everywhere and it Costs You Sales - How to Crush it. It covers almost
every aspect of latency, and is a must-read on the subject. Todd provides a
good explanation of how Space-Based
Architecture helps in reducing latency through collocation of tiers and by
utilizing memory to remove the I/O bottleneck:
The thinking is that the primary source of
latency in a system centers around
accessing disk. So skip the disk and keep everything in memory. Very logical.
As memory is an order of magnitude faster than disk it's hard to argue that
latency in such a system wouldn't plummet.
Latency is minimized because objects are in kept
memory and work requests are directed directly to the machine containing the
already in-memory object. The object implements the request behavior on the
same machine. There's no pulling data from a disk. There isn't even the hit of
accessing a cache server. And since all other object requests are also served
from in-memory objects we've minimized the Service Dependency Latency problem
as well.
In this post I wanted to summarize my take-aways from Todd’s article and add
some of my own thoughts based on my experience with GigaSpaces customers.
Sources for latency – is it the network or the software?
When discussing latency most people fall into one of two main camps: the "networking" camp and the "software architecture" camp. The former tends to think that the
impact of software on latency is negligible, especially when it comes to Web
applications.
Marc Abrams says "The bulk of this time is
the round trip delay, and only a tiny portion is delay at the server. This
implies that the bottleneck in accessing pages over the Internet is due to the
Internet itself, and not the server speed."
The "software architecture" camp tends to
believe that network latency is a given and there is little we can do about it.
The bulk of latency that we can control lies within the software/application
architecture. Dan Pritchett's Lessons for Managing Latency provides guidelines for an application architecture that addresses latency requirements using loosely-coupled components, asynchronousinterfaces, horizontal scale from the start, active/active architecture and by avoiding ACID and pessimistic transactions.
So is it the network or the software?
The simplest way to answer this question is to run a mockup test that removes
the impact of the software on latency from the equation.
Global optimization vs Local optimization
To better understand latency optimization we can use the analogy of a plant production line. We have a big pipeline of things that need to get done and we need to
look at each element in the pipeline to optimize our production latency. In an
earlier posts - Moving
to Extreme Transactions Processing using Lean methodology - I discuss how
we can apply the same principles used in manufacturing line optimization, such
as the Lean methodology, in software systems. I tried to illustrate the
applicability of some of the core principles of lean methodology. Here’s a
recap:
In many cases, we can get more
bang for the buck by looking at an extended value-stream, as opposed to a
localized one. Local optimization means digging into the latency path in a
specific component in our system. With global optimization, however, we look at
the entire pipeline and optimize at that level.
An example of local
optimization would be looking at our messaging system and lowering the latency
of sending a message from point A to point B. An example of global optimization
would be looking at the end-to-end transaction. Processing a typical transaction involves sending messages through a messaging system, consuming it, and then updating the database. If we collocate the message queue with the data receiver, we can easily eliminate half of the network hops. Additionally, if we’ll use the same storage for messaging and data, we can avoid the 2-phase commit overhead. At the same time, we can analyze the user experience to see how many clicks it takes to perform a given
operation. By reducing the number of clicks we can reduce the perceived latency
much more then we can by reducing the time it takes to process each click. It’s
easy to see how we can get better latency savings by taking the global
optimization view. Global optimization often has much more room for
optimization than a local one.
If our system is already designed with a scale-out
model, adding more machines and spreading the load is much simpler than trying
to apply local optimizations.
Scalability as a major source for latency.
One topic that is often missing or less understood in many latency discussions
is the impact of scalability on latency.
Todd writes: "We put shards in parallel to increase capacity, but
request latency through the system remains the same". This statement
is a common fallacy. It assumes that each request is completely
independent of the others. In reality, however, if the application is not designed
with a scale-out/share-nothing approach then at some point it will hit a shared contention,
which makes those supposedly parallel requests dependent on each another.
Contention happens when multiple concurrent users or business-requests hit
a shared resource at the same time. A shared resource might be a hardware
resource -- such as CPU, memory or disk -- or a software resource, such as a shared
database lock. Shared resources need to be freed before another request can be
processed through them. This contention time is proportional to the number of
concurrent attempts to consume the shared resource and the duration in which
the resource is locked. This is one of the basic principles of Amdahl’s Law,
which shows that to increase processing capacity of a request that spends 10%
of its time on a shared lock will require a 100x increase to CPU power. This
contention time, therefore, must be added to our “latency path”. In a non-scalable
system this will be proportional to the number of concurrent requests, meaning
it will rapidly lengthen as the system load increases, up to the point in which
the system will face “resource starvation”. (See further discussion of this here
).
Based on my experience, hardware and software contentions are some of the
main contributors to latency. This is partly due to the fact that network
overhead latency is relatively fixed, while application overhead latency is
variable. It is extremely complex to design a fully-optimized software
application.
Scalability happens to be one of those things that are often implemented in
a non-optimized manner, and as mentioned above, lead to latency. The only way
we can reduce the scalability overhead on latency is by reducing the contention
points in our application. The typical method for reducing the contention is
through “sharding” (partitioning) the access to those resources using a
share-nothing approach. Disks are less concurrent than memory, and therefore,
removing the dependency on disk access in the critical path of the operation is
one of the keys to a latency reduction strategy. This is one of the key principles
of Space-Based Architecture:
The impact of peak load provisioning on the latency cost
Another source of latency is related to provisioning.
Many web sites uses static provisioning based on peak load. But how do we
measure peak load? With the introduction of social networks, and phenomenon such
as the Digg Effect, it becomes extremely hard to predict peak loads, as user
traffic is subject to “viral behavior” leading to sudden spikes in traffic. The
further ahead we try to plan, we increase the chances of missing the target.
This will lead to one of two outcomes: 1)Over-Provisioning – in which case
latency is not harmed, but we unnecessarily pay the cost of more servers and
other resources than we normally need. 2) Under-Provisioning -- in which case
our site may significantly slow down or even crash.
Use on-demand scaling to smooth the latency peaks
If we can't predict the peak loads accurately, we need to scale the system rapidly
whenever we see it is approaching capacity. If the system was not designed for scale-out
(linear scalability), the process of scaling will involve a substantial amount
of work and tuning, which is time-consuming and therefore defeats the purpose.
A scale-out approach enables us to scale on demand and smooth out the impact
of load spikes on application latency by adding servers when the load is up and
removing unnecessary ones whether load is reduced. In this way we can cost-effectively
control latency.
Cloud computing and virtualization enable us to build such an “elastic
computing” model with significantly less effort than previously necessary. For
example, the GigaSpaces Cloud Framework already supports on-demand scalability
for web containers.
My colleague Shay Hasidim posted a latency-benchmark
that measured how low-latency is maintained by increasing the number of
servers.
From the results above we see that as we increase the number of web servers
system contention (scalability barrier) grows in terms of the number of
concurrent users. With a single server, latency increases starting with 100
concurrent users; with two servers, at 300 concurrent users, and with three
servers -- 500 concurrent users.
To ensure linear scalability on the web-tier, we must
ensure that the underlying data-tier scales-out at the same level as can be
seen in the diagram below. In this case we used GigaSpaces’ In-Memory Data Grid
as a front-end to a MySQL.
![Web_bench3[1] Web_bench3[1]](https://natishalom.typepad.com/.a/6a00d835457b7453ef01053622fdea970b-800wi)
With the graph above we
see that the IMDG scales very close theoretical linear scalability. The above
results were achieved with an IMDG running on 2 partitions. Better scalability can
be achieved by increasing the number of partitions.
Read more about how to scale-out the data-tier in Scaling-out
MySQL.
To enable this level of on-demand scalability we used our new Cloud
Framework, which combines the GigaSpaces SLA-driven container as the
application deployment virtualization layer, Amazon EC2 as the machine level
virtualization layer, and the GigaSpaces application server as the middleware
virtualization layer. This way we can provision new machines as soon as
the SLA on the web-tier is breached (measuring
latency, in this specific case). When such an event happens we launch new
machine instances on EC2. A new web container is provisioned on these machines
through the GigaSpaces SLA-driven deployment system. An apache load-balancer
agent is responsible for synchronizing the load-balancer whenever a new web container
joins the cluster. Using this approach we can achieve end-to-end dynamic scalability,
starting from the load-balancer, through the web-tier and business-tier, and ending
with the data-tier.
It is important to note that while this test was performed on EC2, there is
nothing that bounds the solution specifically to the EC2 environment. In fact,
we used the same exact model to enable dynamic scaling on private-clouds using
GigaSpaces and the Sun Grid Engine, for example. A more detailed description of
that is available here.
Data Query latency
Query latency is the time it takes to process a query request and receive
the result. There are a few factors that influence query latency:
- The time it takes to access
the data (read it from file in case it is stored on disk)
- Contention - the time spent
on a shared lock to access the data
- Complexity of the query - the
number of calls involved in executing the query
We can address each of those issues as follows:
- File systems are not
optimized for concurrent access. In addition, file systems are stream-based
systems that enforce serialization and de-serialization of the data every
time we wish to access it. An easy way to eliminate this overhead is to
put the data in memory, which enables access to it using a direct
reference. (See further details in InfoQ
Article - RAM is the new disk).
- We can reduce contention by
partitioning the data, which also results in partitioning the lock.
Putting the data in-memory also reduces contention because memory is much
more concurrent than disk, and we don't need to inherit global file system
locks. Instead, each data item can have its own locking. This will enable
much more concurrent access to our data.
- Collocating the business
logic with the data - We can reduce the number of remote calls required
for each query using a stored procedure approach, meaning the business
logic runs collocated with the data. For partitioned data we will need to
use a MapReduce-like pattern to enable execution of the queries on
distributed data sources. The fact that our data source is now partitioned
enables us to reduce the time it takes to query compared with running the
same query in a centralized database for the following reasons:
- The data-set per partition is
smaller; and
We can leverage the full capacity of the CPU/memory of
each partition to get more power to process the query
Garbage collection impact on latency
Another source of
latency that I found missing in Todd article is the impact of Garbage
collection. Garbage collection is used in any Java or .Net application.
Garbage collection runs as a background thread that cleans all the
unused object in the JVM. In early versions of Java the Garbage
collection implementation used a synchronized block on the entire
memory during the time the garbage collection cleaned those unused
objects. This hiccup time is dependent on the size of memory, number of
CPU's and number of objects that are freed between each GC cycle. In
those early versions of Java it was a common practice to use Object
pooling as a way to reduce this hiccup. Object pooling basically
bypassed the GC work and we had to take control over object lifecycle
in our code. Having said that Object pools themselves became a shared
resource and source for contention. As of Java 5 the GC algorithm was
improved to enable more concurrent garbage collection. This means that
the hiccup time was curved-out over the time therefore had less impact
over our application peak performance. This holds true as long as we
have enough CPU cycles to spend on the GC cycles. The caveat is that if
our application consumes 100% of the CPU all this optimization is not
going to help as when the GC hit our system it will compete with our
application time and therefore the end result is long hiccups again.
Real-time VM aims to address this problem by spreading CPU cycles
between the application threads and GC threads in deterministic manner.
I.e. it will slow down our application in some cases to ensure that GC
gets enough cycles and in that case provide more predictable latency
behavior on behalf of throughput.
The JVM comes with different
switches that enable better control over the GC behavior and provides
means to adjust the GC time. One thing to note about GC optimization is
that it tends be close to a voodoo art. it works in certain scenarios
and break in others so It is very hard to find the right combination.
Avoiding GC hiccups - Avoid over utilization
Based on my experience, the simplest and most effective way to avoid GC
hiccups is to avoid over-utilization. This means that we need to plan our
system in a way that wouldn't consume more then 80% of the CPU and memory under
peak loads (you can choose a different threshold that may be more appropriate
for your organization). For example, if the servers can process up to 500
requests per second at 100% utilization, and we have a requirement to process
1000 requests/sec, it is better to provision three machines, each processing
roughly 330 request/sec, rather than two machines that are maxed out at 1000
request/sec. We also need to make sure that we have the right proportion
between memory and CPU. For intensive read/write applications, I would go with at
least 1CPU/2GB, and if possible, even 1CPU/1GB. These rules of thumb should get
you most of what is needed in terms of latency. Obviously if that’s not enough,
then you need to dig deeper into the GC flags or consider a Real-Time JVM, but
you should use those options as a last resort.
A note 64 Bit VM provisioning:
Lately I've experienced some cases where people
thought that they can use 64-bit machines and large memory heap sizes to reduce
the cost of the system (mostly due to software license costs and machines maintenance
fees). The assumption was that if with 64-bit each process can manage more
memory, they can use fewer machines and fewer processers. What they didn't take
into account are the considerations I presented above. The number of CPUs needs
to be proportional to the amount of memory, and not just to the number of VM
processes running the IMDG. This means that using 64-bit VMs can reduce the number
of machines, but might have almost no impact on the number of CPUs the system
will leverage. As for the amount of memory that each process can handle - that
number tends to vary widely, so I don't feel comfortable giving a concrete
number other than to say that I know of systems using the GigaSpaces product
with 8GBs per process.
The cost of latency
Everything we do has a $ value associated with it. Latency is no different. Todd
mentioned in his post some of the issues related to latency cost.
The cost associated with losing users due to a bad user experience –
this measurement is typical for e-commerce, social networking and search
engines sites: "Amazon found every 100ms of latency cost them 1% in sales.
Google found an extra .5 seconds in search page generation time dropped traffic
by 20%."
Another cost associated with losing trades – in this case the cost is
a measure of the chance of losing business when your competitor can trade
faster than you do: “A broker could lose $4 million in revenues per millisecond
if his electronic trading platform is 5 milliseconds behind the competition.”
Another aspect that was not mentioned is the operation costs of achieving
latency targets. This cost factor applies to latency in the same way it applies
to other scalability operational costs.
The cost of over provisioning – if the system was not
designed for on-demand scaling then we are probably spending money on over-provisioning.
Meaning the system is statically provisioned to have more machines than we
actually need on average, and we pay the costs of under-utilization (idle
resources waiting for peak loads).
The cost of failure – if the system was under-provisioned, then we are
likely to face the cost of downtime. According to a Forester survey conducted
with 235 organizations, 33% estimate the hourly cost of downtime at $10k-$100k
, 25% at $100k-$500k, and 13% $500k-$1M.
How cloud computing can help to improve latency and save some of the latency cost
- Built for on-demand
scalability – cloud computing is a great enabling infrastructure built
for on-demand scaling.
- Geographically
distributed – we can improve latency by running our servers close
to the geographical location of the user. Quoting Todd’s article again: "Facebook
opened a new datacenter on the east coast in order to save 70 milliseconds
"
We can now have data centers spread around the globe at our disposal making
it easy to run our applications in those different data center locations and
point the user to the closest location at a fraction of the cost.
It is true that all this can be achieved without cloud
computing. But cloud computing reduced the barrier to entry so that even the
smallest startup can apply these optimizations, previously considered a luxury
that only big companies could afford
My 20/80 rules for achieving predictable latency
I'm sure that many readers are aware of the fact that out of the many
possible sources of latency, there are some that are beyond our control: Internet
routers, for example. One of the key questions I ask myself in relation
to latency is whether there is a 20/80 rule. What are the 20% of the
things I should focus on that will help me reduce 80% of the latency. In this
section I'll try to provide the guidelines I use for designing a system for
optimum latency.
- Focus on application
architecture and leave hardware and OS optimizations as a last
resort. The performance provided by commodity hardware should be good
enough for 80% of cases. In addition, the effort of optimizing hardware
and Internet routers might involve a huge investment, and therefore,
should be used sparingly. If you’re not sure whether the source of latency
in your application is the network or the software, run the tests I
mentioned above.
- Start with global
optimizations – before you begin to optimize the database, the
router and the messaging system, look at the entire pipeline of your
business request. By looking at that global level, you may find that parallelizing
some part of the request, or changing some of the reliability/consistency
requirements, may have a much bigger impact than any local optimization.
- Use Spaces Based
Architecture principles (even if you’re not using GigaSpaces) –
quoting Todd again:
- Co-location of the
tiers (logic, data, messaging, presentation) on the same physical machine
(but with a shared-nothing architecture so that there is minimal
communication between machines)
- Assemble/Collocate
your application components based on the runtime/execution flow
dependency and not based on their function in the system. For example if
each request need to go through various steps such as parsing,
validation, matching and execution it doesn't make sense to do each of
those steps in separate process/tier. Instead you can make sure that all
of those steps will be collocated and split the application into
multiple units each containing all those various components. In SBA we
refer to those units as processing units. This is probably one of the
main difference between Space Based Architecture and Tier based
architecture. In tier based approach our application is broken down into
presentation tier, business logic and data-tier where in SBA we tend to
collocate all of those tier as much as possible and split the
application into multiple horizontal units each containing all the tiers
to reduce the amount of moving parts and network hops.
- Co-location of
services on the same machine
- Maintaining data in
memory (caching)
- Asynch communication
to a persistent store and across geographical locations
- Avoid calling any
disk/database operation at the critical path of the execution. With the
addition of data-grid we can use in-memory data as the system of record.
This enables us to avoid data or file access during the critical path of
the user request and delegate the update to the data base as an
asynchronous operation.
- You can add to
Todd summary the other pieces associated with Query optimization such as
the use of Map/Reduce and moving the logic to the data is located that I
laid out above.
- Design your
system for dynamic scalability – Dynamic scalability doesn't
necessarily means that scaling needs to happen in real time. It means that
scaling can be done without changing code and the cycle of scaling is
short. In many real-life scenarios “short” could mean a day or even a
week.
- Provision correctly
- Avoid over utilization.
- Other tips for
optimizing the architecture:
- Decouple application components
- Use SOA and EDA to make your application granular enough so that you
can easily change the way you assemble the different components of your
system without code changes. This flexibility is important as it will
allow you to decide at different stages what your business logic pipeline
is going to look like. It will allow you to optimize later, such as
collocating elements that have strong dependencies among them from a business
perspective.
- Abstract your
communication layer - Abstracting the network layer enables latency reduction
when the components are collocated. This abstraction is also important to
enable easy plug-ins of different transports without changing code. Assume
that in the future new protocols, transports and other technologies will
be introduced. By decoupling your code from the transport you can easily
plug them in when they become available.
In most cases, following these steps gets you most of what you need. It also
provides a good basis for eliminating many of the factors that make latency
optimization on other layers more difficult. For example, if we run the
business logic in a collocated in-process mode we isolate the impact on our
code from external factors such as routers. It also provides a good model for troubleshooting
and optimizing the system in case latency goes wrong. Rather than dealing with
latency as a big-bang project, we should break it down into the levels that
enable us to deal with the latency problem in a more gradual manner.
Recent Comments