Background
You don’t need to be an expert to realize that a failure of an eCommerce site during Black Friday or Cyber Monday is a disastrous event, leading to huge loss in revenue and reputation for the retailer. As the share of eCommerce accounts increases to more than 8% of total US retail sales this year, the impact of failure becomes more significant - not just to the site itself, but on the the overall economy. A study on the subject, compiled by Joyent and New Relic, showed that 86% of companies experienced one or more episodes of downtime last holiday season. At the same time, 58% of customers will not use a company’s site again after experiencing site errors.
Another study by Radware measured not just the impact of downtime on eCommerce sites, but also the impact of slowness - an even more common and less measured metric. According to this study a one-second delay correlates to:
-
2.1% decrease in cart size
-
3.5-7% decrease in conversions
-
9-11% decrease in page views
-
8% decrease in bounce rate
-
16% decrease in customer satisfaction
-
a 2.2 slowdown equals a 7.7% conversion rate hit.
Meanwhile, KISSmetrics illustrated how page loads longer than three seconds lead to a 40% bounce rate.
Obviously there is enough business incentive here to invest in handling both the downtime and latency issues. Meanwhile taking a look at a typical retailer traffic (source: Akamai) during this season, we notice that the traffic spikes increase by at least 500%:
In this post I will share our specific experience and lessons-learned from the 2014 holiday season which turned out to be very successful. I believe that the results below speak for themselves.
2014 Results
How We Achieved These Results
-
Taking a preemptive approach - rather than reacting after a failure occurred- prevented failure in the first place.
-
Common Causes: Most failures are the result of misconfiguration or capacity planning guesswork.
-
Planning: Proactive tuning with continuous system monitoring produces predictable improvements at scale, eliminating risks from incorrect provisioning.
-
Knowledge & Experience: eCommerce applications are complex and built from many subsystems. In many cases, an eCommerce organization does not have the expert skill-set in each of the subsystems. Having an expert in the room helps to bridge this gap and builds the capabilities of business operations.
-
Fast Feedback: When product-related issues are identified, we were able to provide the fastest path to protect the business and address concerns in a timely fashion.
To give you a bit more insight on this process I’ve added a section to this post called Stories from the War Room which illustrates a real-life incident and the action that was taken by our on-site engineer to resolve it.
-
Using In-Memory Computing to buffer peak load access to shared data resources
-
Typical eCommerce systems have shared data resources for managing inventory, orders, and catalog information.
-
Putting the shared data resources in-memory provides faster and more efficient (parallel) access to this shared data.
-
Data is mirrored back into the database in batches. In this way, peak load transactions are buffered so that database traffic does not crash the database back-end.
-
The In-Memory Compute grid acts as a system of record. Failure in the underlying database can be saved without affecting the online users while the database is restored to a working state.
-
Using a combination of In-Memory & SSD allows very large In-Memory data sets to be stored at a reasonable cost, while still ensuring fast recovery during failure.
-
Self-Healing Systems recover from failure in real time
-
Failures are inevitable: Keeping a backup copy in-memory enables zero-downtime systems to service user traffic without interruption, even if something does go wrong.
-
Systems provisioned for failure handle failure by design.
-
Automatic failover and provisioning eliminates the need to overprovision (costly) resources in case of failure. Traditionally, it’s common for retailers to provision resources for holiday season that are 5 times the capacity of non-holiday traffic infrastructure.
Two Examples
In one case, a Top 100 online retailer used XAP to provide access to its catalog and inventory data and achieved its first zero-downtime holiday season in several years. As a result, this retailer delivered a vastly improved customer experience from previous years (achieving an 18% improvement in customer satisfaction ratings) and generated a 139% increase over 2013 holiday sales.
In another case, a Top 30 US online retailer logged a record-setting peak sales day of $44 million. This was especially notable because that same day the retailer experienced system performance issues caused by an automated hardware failover condition. Fortunately, the retailer’s XAP implementation began automatically relocating application components to standby resources, keeping apps running despite the complications. As a result, consumers continued to shop—and buy—with minimal disruption.
Stories from the War Room
I’ve picked two issues that we identified as our engineers were working on-site with one of our top eCommerce customers. I thought that these two cases provides useful insight on how a preemptive support strategy and a short feedback loop works:
Issue #1: Sudden slow client response time
The quote below was taken from the direct on-site report:
GC Spikes is one of the common issues that we encounter for managing in-memory data clusters. As GC tends to compete with the same CPU resources that serves the user transaction, this often leads to overall slowness of the system. Fairly quickly, this slowness can pile up into a huge backlog which can break the system in unexpected areas.
The resolution was to split the cluster into more data containers (GSC in XAP terminology) as this will allow better spread of the load across the entire cluster. In addition, overall capacity (memory and CPU) that was allocated for the cluster was increased to meet the increasing capacity demand.
The diagram below provides a view of one of the clusters at the time the issue occurred.
As can be seen, around 23:00 the system started to hit its high CPU mark as a result of GC spikes. The system was gradually rebalanced after couple of hours without facing any downtime. The preemptive action that was taken to handle this incident prevented it from piling up and causing a complete system failure.
Issue #2: Connection Issues
In this incident, the increase of concurrent client activities during peak load resulted in a large number of network connections that were opened at the same time. One of the nodes in the cluster was misconfigured with a low limit on the number of concurrent connections that could be opened simultaneously. The resolution was to kill that faulty node and leverage the self-healing capability of XAP to force an immediate re-route of clients into the backup node while relocating the faulty node into another machine.
Final Notes
Peak load performance often tends to stretch any system behavior in areas that are least expected and thus are often hard to handle. Quite often, peak loads lead to unexpected downtime.
There are many cases in which this sort of peak load performance is known in advance, as is the case with Black Friday and Cyber Monday. Still, many eCommerce sites continue to experience downtime or slowness during such events that lead to huge loss of revenue and reputation.
As a software vendor, we have often found ourselves involved in the early architecture discussion phases, which usually take place as a result of failure in the previous year. Despite the fact that we are brought in to solve these peak load performance problems, we are still called during a fire drill when those failures occur. Often, the result of the failure was misconfiguration or a problem on another system that manifested itself as an issue in our product. Obviously the experience of handling fire drills is never be pleasant, neither for us nor for our customer and that is something that we wanted to avoid as we approached the 2014 holiday season.
This year we decided, together with our customers, to take a more preemptive approach by putting an engineer on-site to escort the customer team during the event itself. This resulted in huge success, leading to 100% up time. Both teams learned much from the experience; the customer learned even better how to operate our product and what to look for to ensure that the system is running properly. We learned much about how the customer is using our product and were able to shorten the feedback loop between the customer and our product and engineering team.
With those lessons in hand I feel that both we and our customers are much more equipped to handle 2015. I can’t wait to write about the lessons learned from Black Friday 2015.
References: