Abstract:
Memcached is one of the most common In-Memory cache implementation. It was originally developed by Danga Interactive for LiveJournal, but is now used by many other sites as a side cache to speed up read mostly operations. It gained popularity in the non-Java world, too, especially since it’s a language-neutral side cache for which few alternatives existed.
As a side-cache, Memcache clients relies on the database as the system of record, The database is still used for write,update and complex query operations. Since the memcached specification includes no query operations, memcached is not a database alternative, unlike most of the NoSQL offerings. It also exclude memcache from being a real solution for write scalability. As a result of that many of the heavy sites started to move away from Memcache and replace it with other NoSQL alternatives as noted in a recent highscalability post MySQL And Memcached: End Of An Era?
The transition away from memcached to NoSQL could represent a large investment as many sites are already heavily invested in memcached usage. In this post, I'll illustrate an alternative approach in which we’ll extend the use of memcache for write scaling, add other goodies such as high availability and elasticity by plugging GigaSpaces as the backend datastore, and avoid the need for a re-write. The pure Java implementation could also be seen as a benefit as it can increase the adoption of memcached within the Java community and leverage the portability of java to other platforms,
Memcached overview
The diagram below shows a typical use of Memcache. It outlines a simple deployment topology often referred to as a side cache. This topology is very popular to address read mostly scenarios.
Typical use of Memcache for scaling read mostly applications
In this mode, read operations first check the memcached servers for an artificial key, derived from the requested data. If the key doesn’t exist on the memcached server, a query is made to the database, and the result is then stored with that artificial key into the memcached instances. Subsequent calls with that query go through the memcached deamon and thus saves access to the database. Updates remove the data from the memcached instances, then write the data to the database and the memcached instance at the same time in the most common situation. Wikipedia has a code snippet that provides a more detailed example of this in action.
The main benefit of this model lies in its simplicity. The memcached API is extremely simple to use for this specific scenario. The fact that it relies on a simple and open protocol as opposed to a rich client bound to to a specific language and implementation makes it portable across a wide variety of languages and environments. It also known to be fairly scalable for handling read operations as due to the inherited share nothing model.
The advantages of memcached that make it so simple and popular are the exact same things that make it fairly limited for any scenario that goes beyond read-mostly/side-cache scenarios. Some of those limitations were the main driver that forced some popular sites to switch to a NoSQL as an alternative to memcached as noted by Digg and Twitter.
From Looking to the future with Cassandra:
The fundamental problem is endemic to the relational database mindset, which places the burden of computation on reads rather than writes. This is completely wrong for large-scale web applications, where response time is critical.
From Cassandra @ Twitter: An Interview with Ryan King:
We have a lot of data, the growth factor in that data is huge and the rate of growth is accelerating. We have a system in place based on shared mysql + memcache but its quickly becoming prohibitively costly (in terms of manpower) to operate. We need a system that can grow in a more automated fashion and be highly available.
Summarizing the points from Digg and Twitter, the main limitations of the current Memcached implementations are the lack of support for:
- Write scalability
- Elasticity
- High availability
I would also add to that list the lack of a consistency model, the limited query support, a client-based optional sharding model, management, and monitoring.
Marrying memcached and NoSQL
Many of the limitations of memcached outlined above are addressed by various NoSQL/In-Memory-Data-Grid implementations. The main difference is that many of the NoSQL alternatives were designed to act as a database and not just as a side-cache. As such they were geared for write-scaling and come with built-in elasticity and high availability in mind.
The main motivation behind the integration between the two is to address the current limitations of memcached without forcing a complete and expensive re-write. Another motivation is that we can leverage the rich client and language support that comes along with the memcached protocol as a simple way to extend the use of NoSQL alternatives to other languages.
While the pattern of integrating Memcache with existing NoSQL backend is fairly generic, in this post i would refer specifically to the GigaSpaces integration.
GigaSpaces memcached support
Joseph Ottinger published a good summary of the rationale behind the GigaSpaces memcache support in a post on TheServerSide.com, “Did Someone Say GigaSpaces Now Has Memcached Support?” In this post, I'll try to provide a reference to the underlying GigaSpaces implementation in the context of the memcached usage pattern that I outlined above.
The GigaSpaces memcached support is fairly simple. It consists of a listener service (written in Java) that implements the memcached protocol for client applications, and maps the protocol into equivalent GigaSpaces operations. There are three mode of operations in which the integration works:
Memcached-compatible – reliability and write-scaling
The memcached-compatible mode is set to be compatible with the exact same way that memcache client works today.
In this mode, each GigaSpaces node runs an embedded instance of the memcached service. The communication between the memcache client and the GigaSpaces node is handled through the memcached protocol. The GigaSpaces memcached service communicates with the GigaSpaces backend directly in-memory.
By using GigaSpaces as the backend datastore we gain immediate benefits from the robustness and richness of the GigaSpaces environment. This includes write-scaling and reliability through the built-in database integration for pre-loading data from a database and write-behind for storing data back into the database. In addition to that we gain built-in management, monitoring and deployment automation (through the GigaSpaces dev-ops API). The fact that the entire stack is built in Java also enables to leverage the Java tool ecosystem (debugging tools, monitoring tools, profilers) as well as the portability of the Java Virtual Machine.
As the memcached protocol doesn’t include dynamic discovery, the memcached clients have static binding to the backend server, which doesn’t provide GigaSpaces’ elasticity and dynamic scaling.
Memcached load-balancer – elasticity + write-scaling, reliability
In this mode of operation, the memcached server instance does not use the embedded GigaSpaces instance but instead would reference a remote GigaSpaces cluster. In this mode, the GigaSpaces memcached server instance becomes a memcached load-balancer between the memcached clients and the GigaSpaces data partitions. Each GigaSpaces cluster could have many memcached-load-balancers. As with HTTP load-balancers, the clients only need to point to a single IP address, and every memcached request is mapped to a GigaSpaces partition according to the key used in any given operation.
The main benefit of this mode is its simplicity and elasticity. Elasticity is achieved by the fact the GigaSpaces cluster can grow as needed. The GigaSpaces proxy will discover those instances dynamically and route the memcached requests to the appropriate node. The same applies to fail-over i.e. if one of the GigaSpacs nodes fails the proxy will rote the request to a hot backup automatically.
The memcached load balancer can be also hosted as a service within the GigaSpaces container. To gain even better scalability of the load-balancer we can combine the memcached-compatible mode (our first example) with this load-balancer mode where each of node in the gigaspaces cluster could have the embeded memcached service configured as a load-balancer. In that configuration, a client can succeed by selecting one of the nodes in the cluster while gaining access to the entire cluster.
The downside of this approach is performance as each memcache request will to go through two network hops. For that purpose we often set GigaSpaces LocalCache to minimize that extra hop for read operations. In this way subsequent get operation on the same key would be resolved within the load-balancer itself. Any update on that key through any of the partition will be updated automatically on the local cache as well.
Another potential use of that mode is gateway between two network domains. It is much easier to cross firewall and network domains through a single gateway then having the entire cluster available.
Memcached RPC – Map/Reduce support, multi-lang data feeder
One of the interesting and lesser-known uses of the memcached protocol is for Remote Procedure Calls or Map/Reduce operations. In this mode, a set operation on the memcache server will be translated to a command, and a get operation is translated into a return value for that command. This is extremely useful for data-feeder scenario and aggregated query scenarios. Non-Java feeders can also leverage the GigaSpaces support for dynamic languages to pass in code that will get executed in the server close to the data.
The execution of the code would be done through the use of the GigaSpaces polling container. In this case we would set the polling container to poll for the memcache entries.
NOTE: We may be releasing a new version with built-in support for this specific pattern so the API is still subject to change ..
Simple Example:
The following example illustrates the first two modes outlined in this post – the memcached-compatible and the load-balancer modes.
For the sake of simplicity we will use a single memcached server instance. A more advanced deployment of a complete cluster is referenced at the end of this post.
Running a single memcache instance:
In this example we will run a GigaSpaces instance using the gs-memcached{.sh/.bat} . The gs-memcached utility takes one argument that points to Space URL.
> gs-memcached <Space URL>
Note for gs-memcached utility is available only since GigaSpaces 8.0 release. For earlier versions use puInstance utility instead using the following format:
> puInstance(.sh/.bat}" -properties space embed://url=”<Space URL>” "<gigaspaces root>\deploy\templates\memcached"
Running a single memcache in compatible mode:
To run a memcached server that reference to an embedded GigaSpaces instance we will use a Space URL in the following form: “/./<name>”
In our example we will use “/./memcache” which basically set a memcache server that points to an embedded GigaSpaces instance under the name “memcached”.
>gs-memcached{.sh/.bat} ”/./MyMemcached”
or simply
>gs-memcached{.sh/.bat}
which puts a default name “/./memcached” as the Space-URL
Running a single memcached load-balancer:
To run a memcache in a load-balancer mode we would reference to remote GigaSpaces cluster by setting the Space-URL to a remote cluster as outlined below:
> gs-memcached{.sh/.bat} ”jini://*/*/remoteGigaSpaces”
A simple memcache client:
There is nothing really unique in our memcached client. We can use any standard memcached library and point it to the host:port of our server. The default port is set to11211.
1: MemcachedClient memClient=new XMemcachedClient("localhost",11211);
2: memClient.set("1", 3600*60, "some value");
3: memClient.get("1");
4: memClient.delete("1");
Monitoring and Management:
One of the benefits of using GigaSpaces as the backend data store is the fact that we can leverage the existing monitoring and management to monitor the statistics and activity of our memcache deamon. We can use any of the GigaSpaces management tools for that purpose, including the rich client UI, the web-based UI, and command line and API based management and deployment tools.
The diagram below shows the real time statistics gathered from the gs-ui utility.
Advanced setup:
In a more advanced setup, we would want to leverage the GigaSpaces Dev-Ops API and SLA driven container to automate the deployment and managing the SLA of that deployment. A detailed description of that mode is provided in the GigaSpaces memcached documentation.
Final words – Qcon next week
We often tend to look at any new technology as like a sort of new religion. Quite often, our needs change and with that we tend to switch from one technology (religion) to another. Memcached and NoSQL are a good example for such a transition. I personally believe that to gain the benefit of a new technology we don’t need to completely abandon previous technologies but instead learn from them to make the new technologies and techniques better. That is particularly true when it comes to data technology - consider SQL, which been around for more then 4 decades!
A new writeup that came out last week on GigaOM, “Will Scalable Data Stores Make NoSQL a Non-Starter?,” is just a reminder of how quickly technology tends to shift. That makes the importance of taking a more evolutionary path toward the revolution that we’re going through even more important than ever before.
During the Qcon session Yes, SQL! next week Uri Cohen will lay out some of the various NoSQL query languages that are available today starting with memcached, MongoDB, Cassandra, Redis and outline a model for leveraging the best out of each in our existing JEE world. I hope it will lead to an interactive learning session and debate, as it often does at QCon.
References
- Memcached (Wikipedia)
- MySQL And Memcached: End Of An Era? (highscalability.com)
- Did Someone Say GigaSpaces Now Has Memcached Support? (TheServerSide.com)
- GigaSpaces memcache documentation. (GigaSpaces.com)
- GigaSpaces FREE Community edition (GigaSpaces.com)
- YeSQL: An Overview of the Various Query Semantics in the Post Only-SQL World
- GigaSpaces dev-ops API (GigaSpaces.com)
- Will Scalable Data Stores Make NoSQL a Non-Starter? (GigaOM)
- Cassandra @ Twitter: An Interview with Ryan King (twitter.com)
- Looking to the future with Cassandra (Digg.com)