Eric Lai published a provoking article on Computerworld magazine titled “No to SQL? Anti-database movement gains steam” where he pointed to many references in which different Internet-based companies chose an alternative approach to the traditional SQL database. The write-up was driven from the the inaugural get-together of the burgeoning NoSQL community who seem to represent a growing Anti-SQL database movement.
Quoting Jon Travis from this article:
Relational databases give you too much. They force you to twist your object data to fit a RDBMS [relational database management system],
The article points to specific examples that led different companies such as Google, Amazon, Facebook to choose an alternative approach. I outlined below what i found to be the main drivers behind that trend:
- Demand for extremely large scale:
“BigTable, is used by local search engine Zvents Inc. to write 1 billion cells of data per day.”
- Complexity and cost of setting up database clusters:
“PC clusters can be easily and cheaply expanded without the complexity and cost of ‘sharding,’ which involves cutting up databases into multiple tables to run on large clusters or grids.”
- Compromising reliability for better performance:
“There are different scenarios where applications would be willing to compromise reliability for better performance. A good example for such a scenario is HTTP Session data. In such a scenario the data needs to be shared between various web servers but since the data is transient in nature (it goes away when the user logs off) there is no need to store it in persistent storage.”
Having said all that, it seems that many still agree that despite all the limitations of traditional database solution, SQL database are probably not going away:
“It's true that [NoSQL] aren't relevant right now to mainstream enterprises," Oskarsson said, "but that might change one to two years down the line.”
The current "one size fit it all" databases thinking was and is wrong
The article seem to point to an interesting trend where a growing number of application scenarios cannot be addressed with a traditional database approach. This realization is actually not that new. In 2007 I wrote a summary of Michael Stonebrake’s article, "One size fits all: A concept whose time has come and gone" on my blog: Putting the database where it belongs. The great thing is that It looks like this “old” news is spreading to the larger community. This can be explained by the continuous growth of data volumes, together with the growing need to process larger amounts of data in a shorter time. These two trends force many users to think of an alternative approach to the traditional database. The classic early adopters are those who hit the wall. It is very likely that as these alternative solutions mature, they will find their way into mainstream development as well.
Not your mom and dad’s database
The article seems to over glorify some of the alternatives that where mentioned while downplaying their limitations. A good example is Amazon SimpleDB. I wrote in the past a post about this, Amazon SimpleDB is not a Database, where I outlined some of the limitations of the Amazon SimpleDB solution. As you can see from these limitations, SimpleDB cannot and shouldn’t be positioned as a direct alternative to your existing database.
While I share many of the thoughts and enthusiasm of the anti-SQL movement, I would highly recommend taking very cautious steps toward any of the alternative solutions. It is very important that you make yourself familiar with their strengths and weaknesses, and avoid hitting their limitations at a point in time when you have very little room to maneuver. I’ve seen several cases where users developed their data model in a centralized model and expected that it will scale seamlessly once they switch to a partitioned topology. The fact that you can switch between centralized and partitioned topologies without changing your code doesn’t mean that your application will behave correctly and will scale as you expect.
This topic has actually been the center of a discussion in Architect Summit meeting we had last summer, which was hosted by eBay:
Abstractions and Partition Awareness
A horizontally-partitioned system typically provides an abstraction that makes the partitions appear as a single logical unit. eBay and Flickr, for example, both use a proxy layer to route requests by a key to the appropriate partition, and applications are unaware of the details of the partition scheme. There was near-universal agreement, however, that this abstraction cannot insulate the application from the reality that there partitioning and distribution is involved. The spectrum of failures within a network is entirely different from failures within a single machine. The application needs to be made aware of latency, distributed failures, etc., so that it has enough information to make the correct context-specific decision about what to do. The fact that the system is distributed leaks through the abstraction.
My recommendation would be that you design your data model to fit into a partitioned environment even if during the initial stage you’re still going to use a single centralized server. This will allow you to scale when you need to, without going to through a massive change.
What about in-memory alternatives?
An option that i found missing from this article and becomes fairly popular with many large websites is the use of an in-memory data store, In-Memory-Data-Grids as they are often called, such as Memcached, GigaSpaces, Coherence, eXtremeScale etc. With this model we front-end the database with an in-memory cluster which becomes the system of record and uses the SQL database as the background persistent store. For those looking to build social network graphs, real time events (as in Twitter), real time analytics, fraud detection, session management, etc., that is probably the more natural choice. Todd Hoff from highscalability.com wrote a very good article on this subject: Are Cloud Based Memory Architectures the Next Big Thing? I also wrote a detailed description how this approach works with GigaSpaces and MySQL: Scaling Out MySQL.
What about ACID transactions, consistency etc?
The traditional 2PC (two phase commit) model in which consistency is achieved through a central transaction coordination server is not going to fit with many of the distributed data management alternatives. In an earlier post, “Lessons from Pat Helland: Life Beyond Distributed Transactions,” Pat Helland suggested an alternative model to distribute transactions, a workflow model. Instead of executing a long transaction and blocking until everything is committed, break the operation into small individual steps where each step can fail or succeed individually. By breaking the transaction into small steps, it is easier to ensure that each step can be resolved within a single partition, thus avoiding all the network overhead associated with coordinating a transaction across multiple partitions. This has been one of the core concepts in designing scalable applications with Space Based Architecture (SBA). The Actor model that was introduced with new functional languages like Scala and Erlang is built into the SBA model, with the difference that in SBA, actors can share state and pass events by references, and thus avoid the overhead of copying the data with every transaction.
Shay Banon wrote a good description on how the Actor model works with SBA in this post:
“The above is a diagram of a simple simple polling container that wraps a service (Actor, the Order service in our example). The polling container takes (removes) events (Data), process them, and writes back Data to the collocated Space (Data Grid) it is running with.”
The need for real time data processing – a real life example
In the past weeks we had been involved with a prospect who is looking to add a social network to his eCommerce site. One of the requirement for this new service was the need to build a graph that includes friends of friends, and products in catalogs. The process for building that graph with an In-Memory-Data-Grid took 2-3 msec vs. tens of seconds with a traditional approach (note that part of the complexity in this case is that the query itself is ad-hoc and can’t be easily partitioned). Building that graph on-the-fly at these speeds just couldn't be done with traditional SQL database.
The reason that enabled us to get to this level of performance was:
- We kept the data in-memory.
- A large part of the complex query was pushed to where the data is (you can think of it as a modern alternative to stored procedure).
- We used partitioning to spread the data and leverage the accumulated memory capacity of those memory instances.
- We used both a scale-out and scale-up model to parallelize the query against all instances and take full advantage of multi-core as well as multi-machine power.
- We reduced the number of network hops by pushing the the heavy data manipulation to where the data is and by returning only the accumulated result over the network.
Final words
In summary I would say:
- SQL databases are not going away anytime soon.
- The current "one size fit it all" databases thinking was and is wrong.
- There is definitely a place for a more a more specialized data management solutions alongside traditional SQL databases.
The adoption of these new solutions would be very much determined by two main factors:
- How well they integrate with the “existing SQL world.”
- How easy it will be to develop for these new alternatives, and how smooth a transition a given solution can offer for the average developer.
This is an area of continuous innovation that has been keeping us fairly busy in the past few years, and will probably continue to keep us busy. I’ll leave the details on how we deal with these two challenges to a separate post.