GigaSpaces is Available on Apache Camel

Apache Camel is a Spring based integration framework.
I was happy to see that David Greco released a JavaSpace connector for Camel based on GigaSpaces.

Quoting from David description of the connector:

"The javaspace: component is a transport for working with any JavaSpace compliant implementation, this component has been tested with both the Blitz implementation and the GigaSpace implementation .
This component can be used for sending and receiving any object inheriting from the Jini Entry class, it's also possible to pass an id (Spring Bean) of a template that can be used for reading/taking the entries from the space.
This component can be also used for sending/receiving any serializable object acting as a sort of generic transport. The JavaSpace component contains a special optimization for dealing with the BeanExchange. It can be used, then, for invoking remotely a POJO using as a transport a JavaSpace.
This latter feature can be used for an easy implementation of the master/worker pattern where a POJO provides the business logic for the worker.
Look at the test cases for seeing the various usage option for this component."

Interestingly enough I'm seeing more the use of space based transport used to drive this new type scale-out integration frameworks. Beyond the space transport i believe that Camel users can leverage the fact that they can use the space as a data-store for sharing the state between the various services without needing to go to database just for that purpose.

Nice work David!





July 01, 2008

Twitter's weakening pulse - The scalability "Penalty"

I just came across an interesting article by Dan Ferber: Twitter's Weakening Pulse, which talks about how Twitter is starting to lose its users due to endless scalability and reliability issues:

The Twitter concept has been cloned (Pownce and Plurk), and it won't be long before Facebook, MySpace, or other big players figure out how to make following, followers, tracking, and summizing part of their service

A few weeks ago I published a post that discusses measuring the cost of scaling. Beyond the cost of hardware and software, I also pointed out a cost that is much more difficult to measure, but probably has the biggest impact on the business:

Indirect costs, resulting from the unpredictability of our system and the inflexibility that it imposes on our business, can have huge implications to the business, well beyond any measurable criteria. Even if your company can measure the direct losses from downtime due to lost sales or trades, you'll be hard-pressed to measure damaged reputation, loss of customer trust, and in some cases, the loss of your job...

It would be a false assumption that the company controls the situation, and has time to plan everything in advance. What if Mylie Cyrus shows up on the cover of Seventeen holding the super-cool gizmo the company sells and now all of a sudden there's a mad rush to buy it on the site? We need to assume that we can't predict the load. It doesn't happen when we plan for it. It can happen when we least expect it.

With social networks or electronic trading and e-commerce, such events can quickly lead to a viral effect or what's also known as an 'event storm'. Such a chain of events can quickly lead to disaster, such as a site crash, and the company is now all over the news and the blogosphere (and not in a good way), customers are frustrated and many defect to the competition. The trouble is that they can't fix it in a snap. Remember: they need a few months heads up to re-write the application.

According to a Forrester survey, the cost of one hour of downtime can easily be between $10k to $1 million -- and that doesn't include intangibles such as reputation damage.What i found even more interesting is 25% of companies that participated in that survey couldn't even estimate the cost of downtime.

If we want to be successful we can't afford making scalability an afterthought. The days in which the likes of Amazon, Google and eBay had the luxury of re-writing their applications every few months to meet scalability needs are gone. Customers have many alternatives, and they are not going to patiently wait for a company to go through a lengthy re-write process, but will simply switch to another service.







June 27, 2008

TSSJS Prague: my take-aways

Once again TSSJS was a well-organized event with lots of interesting content. Hot topics that I took  notice of were RIA, new languages, and obviously distributed computing and scalability.

I arrived on Tuesday morning, which gave me a chance to meet John Davies, Ted Neward, Kirk Peperdine and Holly Commins. We found a nice spot not to far from Charles Birdge. At some point we started discussing the reasons we're seeing a burst of new languages. The discussion about languages is thought-provoking. Ted Neward (One of my favorite presenters) seems to be spending a lot of his time recently thinking about this topic. He explained over dinner (while he was completely jet-lagged!) his view. I'll try and summarize the main points:

  1. One size doesn't fit all - we shouldn't try to force one language to do everything and expect it to be good at it all. The concept of using multiple languages in the same application is actually something we've been practicing for a while by using a combination of HTML, CSS, XML, JavaScript, Java, etc. Each language serves a specific purpose.

  2. Different semantics require different expressions, i.e., different languages. An example that was given was Scala and Erlag and the notion of parallel programming as a first class-citizen in the language (as opposed to a set of libraries and explicit APIs in Java). The argument (brought up by Kirk) is that you can't leverage multi-core platforms without languages that were designed to do so.  It reminds me that  indeed multi-threaded programming  wasn't  common until threading became native in the languages. Now you can't think of writing even a simple application without threads. So i think that Ted and Kirk have a valid point.

  3. Usability and productivity - how many lines of code are required to express a certain idea? There are many examples that show how different use cases in Java could be relatively verbose and complex with comparison to the "new" languages.

  4. JVM/CLR makes it easy to introduce new languages as just new views and perspectives running on the same platform. Previously, languages such as Perl and TCL had to be built with an entire stack, typically based on C or C++, and had to be ported to various platforms and operating systems. This approach made the choice of language and language interoperability quite difficult as the decision to choose one languages over the other was considered a "catholic marriage". Today, the JVM in Java and the CLR in .Net enable better separation of concerns. They provide a common platform that can easily support multiple languages. This simplifies interoperability of different languages within the same application. A good example is the new support for dynamic languages in Java 6 and in .Net. This makes the language decision simpler, as the impact of this decision on our project is less drastic and less risky than it was before.

While I think that all of the points are valid I couldn't avoid thinking that we're forgetting past experience. For example, you could easily argue that lines-of-code is only one measurement of productivity. Another measure of productivity is maintenance, i.e., how simple it is to read the code and understand it, transfer it to another programmer, etc. My concern is that if the language becomes too flexible and enables each of us to write our own extension, we're going to find ourselves in a position where the only person that understands the code is the person who wrote it, and even that would hold true only for a certain period of time. Think about C++ templates, macros, operator overloading, multiple inheritance  -- a lot of "nice" features that made our code very flexible, but less readable due to the large number of indirection we had to go through to parse a single class.

One of the things I liked when I switched to Java from C++ was the fact that to understand my colleagues' code all I needed was to read their .java files. In most cases I didn't really need documentation and it was fairly simple to parse the code because Java restricted much of the flexibility that I just mentioned. Trying to do the exact same thing with C++ requires parsing of header files, macros and typedefs. Another issue is that introducing multiple languages can be quite complex and a barrier to productivity, due to limited skill sets within a certain project, even if choosing a language is less risky than before.

I think the concurrency argument is only a temporary one. I'd hate to choose a different language just for that, because it's something that I expect to see native in Java. So far we managed to deal with multi-core and parallel programming quite effectively with Java using event-driven architecture (EDA) and master/worker patterns, and abstract a lot of the concurrent programming with things such as Futures, Remoting, etc. Surely having some of the features of Scala or Erlang as part of Java would have made our life simpler, but if I measure the value vs. risk involved, I'm not sure it justifies using it right now in a real project.

Don't get me wrong. I'm not saying that there is anything wrong with these languages. What I'm arguing is that we need to be very careful before we choose them and make sure that we're measuring the right value, rather than assuming that any of the above arguments applies to our application without proper analysis. Ted was able to convince me to look further into this topic - so I'm probably going to give Scala a try and get a real feel of it.

The event started on Wednesday morning  with a very good presentation by Stephan Janssen. Stephan is the founder of Parleys.com. He is also the founder of the JavaPolis conference held annually in Belgium. He talked about his experience with a wide range of RIA platforms: DHTML, Adobe Flex/Air, JavaFX, Google Web Toolkit (GWT) and Microsoft Silverlight. He discussed his personal experience in using the various technologies as part of Parleys.com.

The combination of a general overview with real-life examples made the discussion quite interesting and lively. The bottom line of this part of the talk was to use Adobe Flex, if you're building a site in the short-term and JavaFX if you're planning on launching your site in about a year's time - due to the maturity cycle and the gaps between the two technologies. Personally I found the fact that there are so many options to do the same thing quite confusing. I wish that we could press the fast-forward button on the maturity cycle of these technologies. Working with previous versions of Parleys.com I must say that i was very impressed with the progress and the *right* use of technologies to build the new version of the site.

Another interesting and quite innovative idea that Stephan presented was about hosting services and collaboration with academic partners. The hosting service will enable companies like ours to host their live presentations in the Parleys.com site. In addition, you can embed the presentation in a blog entry. You can also record your talk online, using a web-based application. The partnership with academic institutions enables scaling not just the content, but also the bandwidth, similar to the way downloads works. IMO Parleys.com could easily become the YouTube of online presentations. If you missed the presentation I'd recommend watching this interview here:

My own presentation, Getting ready for the cloud, seems to have been well-received, although I had some concern that it might be too high-level for some of the audience. You can read some of the comments posted by others here and here.

The presentation included a live demo of a web 2.0 application (displaying market data) running live on Amazon EC2. Although the demo ran over a wireless line, it went surprisingly smooth and I was able to easily redeploy and relocate instances through a simple drag & drop using our UI, which was hosted on one of the EC2 machines. The following day, Uri Cohen gave a session in which he showed the details of what's going on behind the scenes and reviewed the actual code and API used in the demo. If you're interested in experiencing it yourself, you can try out the same demo on our new EC2 version

TSSJS was a good opportunity to meet in person the winners of our OpenSpaces developer competition. I heard interesting stories about what drove them to write their projects. The common theme was the technology challenge - they heard about our technology and scaling pattern and wanted to get a feel for themselves of how it works. You can listen to some of the stories in recent podcasts we published here here and here.

BTW, Jason Carreira, one of the winners, has since worked on another project: a scalable Twitter-like application using GigaSpaces and EC2 (an alpha version already exists, he is now looking for hosting opportunities). And Leonardo Goncalves -- the first prize winner -- is already thinking of the next version of his project. The third winner - Kirill Ishanov -- is also planning to participate in next year's contest. At the end of the first day we showed a video of some of the judges (John Davies and Jullian Browne were missing from the video). It’s a light-hearted video in which the judges also makes fun of Joe Ottinger:)

   
Two of the talks I very much enjoyed were given by John Davies, formerly the founder and CTO of C24 (which was sold to IONA), who has recently started a new venture called Incept5. I've worked with John for many years now and we often have excellent chats about ideas in our respective markets. John's first talk was one I'd heard before, but as always, he updated it with new anecdotes and ideas. He talked about extreme enterprise architectures, specifically ESBs and grid in the low-latency, high-volume, complex envrionments of investment banking. John started by explaining the value of a millisecond to the high-end institutions, literally in terms of dollars, something like $100 million per ms. He went on to talk about compiled languages compared to Java for this sort of processing. It was interesting to see John walk through a very high performance matching and reconciliation engine we had designed together for a client a few years ago and it's exciting to hear that his new company will specialize in this is. John talked about some of the clever coding patterns that had to be implemented to provide linear scalability, and although master-worker was the pattern of choice for scaling, it wasn't as simple as just writing lots of workers.

John's second talk was new to me, and although we discussed the ideas in it in the past, it was fascinating to hear them presented. The room was packed -- standing room only -- as it was a topic near and dear to the hearts of many developers and architects: "The Enterprise without a Database". I thought this would just be an extension of caching, but John went on to emphasize the huge amounts of time and energy (human) being lost on Object-Relational Mapping (ORM). Why do we still persist our well-established Object-Oriented models into a relational database? While ORM is simple at the example level, it doesn't scale given the levels of complexity in today's messaging standards. John made this very clear by example. I got the feeling that he was holding back the solution, perhaps to be released by his new company, but it was clear that there are alternatives to ORM: from caching objects to using CLOBs in classic databases. This is obviously an area to watch, as John always has a good vision for these sort of things.

At the end of the day I had the chance to have a beer with different people in a nice Mexican restaurant in Prague (courtesy of Jodie, the cameraman, and his local friend). After a few beers, mojitos and lots of peanuts (courtesy of John Davis:)), the topic of open source software (OSS) came up. I think that we all agreed that being open is a no-brainer, and that's the way software products should be built. Being open doesn't necessarily mean free - take Jive and Atlassian, for example. They sell commercial products, but they provide customers with the source code

Another model is the dual-license model such as that used by Red Hat and MySQL. It's sometimes referred to as the Fedora model. It means that you have a choice to use a free version but if you do, you're on your own. If you choose to use the supported version, you're going to be charged a subscription, for which you'll get extra features and better packaging/documentation.

I argued that it is important to have a solid business model behind a product/project. It should be as important to the users as to the company developing the product. if a product doesn't have a solid business model two things might happen: the project/product is going to be abandoned at some point due to lack of funding, or the owner of the product will change the licensing model to monetize on the IP and established user-base. We've seen both scenarios happening already.

I also argued that the Fedora model is usually successful only as part of a commoditization strategy. For example, JBoss's strategy was to go after the lower end of WebLogic/WebSphere accounts. The same applies to MySQL. This strategy seems to only work to a certain limit. I argued that this model is not proven in an emerging product category, where large investments in market education and innovation are required to achieve massive and sustainable adoption. In such cases, the Jive/Confluence model seems to be a better fit. Anyway, this topic is worth a separate discussion, so I'll leave it at that for now.

Unfortunately I had to leave on Friday (to be at my daughter's end-of-year party at school), so I missed Shay Banon's presentation. Based on what I heard it went very well.  You can view Shay's presentation here.

Anyway, it was a real fun event and i look forward to next year.

June 24, 2008

InfoQ Article - RAM is the new disk...

Steven Robbins published an interesting article on InfoQ titled  RAM is the new disk.
In the comment thread, Steven Robbins quoted Tim Bray and others comparing file system performance to memory:

Memory is the new disk! With disk speeds growing very slowly and memory chip capacities growing exponentially, in-memory software architectures offer the prospect of orders-of-magnitude improvements in the performance of all kinds of data-intensive applications. Small (1U, 2U) rack-mounted servers with a terabyte or more or memory will be available soon, and will change how we think about the balance between memory and disk in server architectures....

It raises the following questions:

What if the disk were RAM-based? Does that mean that all we need to do is replace the current disks with RAM technology to gain speed? The title of the article leads people to think along those lines.

My own take:

It's not just the speed of memory compared to disks that makes a difference. It's not even the extra benefit of the collocation of CPU and memory. What's really a important is the fact that disk is a sequential storage medium that was designed primarily to store a stream of bytes, not tables of data. That means that if you want to store data objects you need to serialize them into bytes, map sectors in the file system that points to the location of those bytes. Maintaining an index to this data is a relatively expensive operation as every additional index is stored as a copy of the original data, there is no real option to access data by reference, etc. If you think about it, existing RDBMS are basically a mapping layer between data-tables representations and sequential storage. A large part of existing database implementations is spent on addressing the impedance mismatch between the two representations models. All this complexity doesn't really exist when we're dealing with memory. That means that if will take existing databases and run them on memory based devices we're basically going to force the limitations of sequential storage representations into memory.

To exploit the real value of memory based resources we need to have different approach and implementations that assume that data can be accessed by reference - that objects can be accessed directly from our application without complex mapping layer in our native application domain.

At this point I'd like to end with Tim's last remark:

Disk will become the new tape, and will be used in the same way, as a sequential storage medium (streaming from disk is reasonably fast) rather than as a random-access medium (very slow). Tons of opportunities there to develop new products that can offer 10x-100x performance improvements over the existing ones.

June 16, 2008

Meet us @ TSSJS Prague

TheServerSide has a good record of picking nice spots for their conferences, and this year's Java Symposium in Prague is no exception.

It's looking to be a fun event, as I'm going to meet not just lots of old friends, but also the winners of our first OpenSpaces developer contest. I've already written about the contest and some of the submissions in a previous post. If you haven't already, check it out ,as we decided to continue with the contest next year and use  TSSJS as an opportunity for attendees to apply for "early bird scholarships" worth $1,000 each -- all this at an award ceremony we're holding in Prague during the show. Besides free booze and food at this event, we're going to show a nice video featuring the judges of the competition. Those who worked with Alit through the current competition will probably be happy to know that she is going lead this part of the show together with John Davis who was one of the Judges.

I'm going to be there with some of my colleagues from GigaSpaces, namely Shay Banon and Uri Cohen.
My presentation is titled Getting ready for the cloud but it really talks about the next wave in distributed computing in which clouds plays an important role and have the potential of changing many of the things we used to do in the past.

Banon will be talking about some of the work that he's been doing with Mule, Lucerne/Compass and Spring in his session Beyond Data Grids. I've seen him discussing some of these topics in Las Vegas this year, so I know it's going to be really interesting. Last time it sparked many questions about how clustering technologies can deal with scaling challenges, how in-memory data grids can replace or co-exist with traditional databases, and how they can be applied to different frameworks given real life examples.

Uri is going to talk about his experience in building scalable web 2.0 applications using Ajax, Tomcat and Spring MVC, and running on the Amazon EC2 cloud. He will discuss specific patterns for dealing with Ajax scaling issues, and also provide patterns and tips for moving from a tier-based to a scale-out model based on recent work he's done with JBoss and, of course, GigaSpaces.

The TSS event is also going to be a good opportunity for us to expose some of our latest development in our upcoming 6.5 release,such as the new Service Virtualization Framework (based on Spring Remoting), Dynamic language support, extended support for hibernate and enhanced database integration, built-in Maven support, support for Spring 2.5 annotations and enhanced administrative and real-time monitoring.

Mule users will also benefit from our extensive support for the Mule ESB. We're also going to show some of the latest developments with EC2 and cloud computing environments. Even though TSS events tends to be Java-centric, I believe that Java users will be happy to learn about our interoperability among Java, C++ and .Net. For those unfamiliar with it, I would recommend giving it a closer look as it provides high performance and an extremely simple alternative for making the language barrier pretty much obsolete.

There is much more to it than I can cover in this post. In fact, we realized that an entire post will not be enough to cover all the relevant content of our 6.5 release, so expect to see several dedicated posts in the coming weeks -- here and on the GigaSpaces Blog -- covering different aspects of new features, including some "behind-the-scenes" stories. Stay tuned!

June 05, 2008

Economies of Non-Scale

Scalability forces us to think differently. What worked on a small scale doesn't always work on a large scale  -- and costs are no different.

To measure the cost impact of scaling, let's look at the amount of resources required to scale to a given level. We'll use Amdahl's Law as a method to measure the amount of required CPUs. This will provide us a proxy for hardware and software costs. Later on we'll also review other costs related to the process of scaling. What's going to come up clearly in this analysis is that the cost of scalability grows exponentially in non-linearly scalable applications.

I'll also argue that scaling is not just a technical issue: it has a direct impact on our business and its ability to effectively compete.

Measuring the Impact of Scalability on Hardware and Software Costs

The following analysis is based on Amdahl's Law, which is described well in this Wikipedia entry. Put briefly, the negative impact of the non-scalable portion of our application grows exponentially relative to the scaling requirement, up to the point in which adding resources will not improve performance/throughput, as seen in this diagram:

 

Amdhal_law_2

By non-scalable I mean the part that is serial (non-parallelizable) -- or to use  terminology that is more relevant: the level of contention in the application. Contention can be thought of as the percentage of time operations wait on things such as shared table-locks in the database, persistent queues or distributed transactions.

The diagram above shows that if 90% of our application is free of contention, and only 10% is spent on a shared resources, we will need to grow our compute resources by a factor of 100 to scale by a factor of 10! Another important thing to note is that 10x, in this case, is the limit of our ability to scale, even if more resources are added.

Now let's see what it will cost us -- in terms of CPUs and software licenses -- to increase our scalability by a factor of 10, assuming that we have only 10% contention (and that's a fairly optimistic scenario with prevalent tier-based architectures).

 

Costofscaling_4

Take-Aways

There are two key take-aways from this analysis:   

  1. The cost of non-linearly scalable applications grows exponentially with the demand for more scale.
  2. Non-linearly scalable applications have an absolute limit of scalability. According to Amdhal's Law, with 10% contention, the maximum scaling limit is 10. With 40% contention, our maximum scaling limit is 2.5 - no matter how many hardware resources we will throw at the problem

An Example

The following is inspired by numerous true stories I have seen at GigaSpaces customers.

A team is tasked with building an online order management application that needs to process 1,000 orders per second. They choose a typical n-tier architecture with a web server as the front-end and a database as the data-tier. Note: In this case I give the web-tier as the front-end. In some systems the front-end is actually a messaging system. While the implementation is rather different, the behavior for the purpose of this discussion is the same, as both represent different forms of feeds or service requests.

Let's assume that the team designed the architecture by the book to ensure 100% reliability and consistency. This means that every critical transaction is stored in the database.

They now found that a single web server can handle only 200 transaction/sec, so they decide to put a load-balancer on the front-end, and deploy 5 web servers to meet the 1000 orders/sec requirement. At this point they realize that despite the fact that they increased the amount of servers by 5, the total number of orders/sec didn't really increase by much -- and only reached about 400 orders/second.

They start to monitor the system, and find that the application spent 40% of its time on shared locks in the database. As we already saw above, with 40% contention, we can only increase the throughput by a factor of 2.5 -- or 500 orders/second, so no matter how many web servers are added, the application will never be able to meet its throughput goal.

So the team decides to reduce the contention by placing a distributed cache in front of the database, which will reduce the hits on the database. They are cost-conscious so they select a free product -- memcached -- as the caching solution. As memcached cannot serve as the system of record (it doesn't support transactions, queries, is not highly-available, etc.), adding it reduces the contention,  but does not eliminate it.

For this example, let's take an extremely optimistic scenario and assume that memcached reduces the contention from 40% to 10%. According to Amdhal's Law, increasing throughput by a factor of 5 with 10% contention, will require 10 servers. Now the team is happy! 

Just as they get the system to work as expected, the boss knocks on the door and says: "Hey,  we're going to launch a major campaign for the holiday season, and marketing anticipates double the number of visitors on our site. If we're very successful, we might even triple the normal traffic. Can we support this?".

The team is in initial shock. "Double? Triple? Are you kidding me!? We just worked our ass off to get this application to work for the existing load, and now he's talking about doubling and tripling the load as if it were trivial?" The team now faces the prospect of explaining to management that they don't know how to achieve this capacity, or even worse, that they are incapable of it (while their competitor has already achieved it last year). Let's see: double throughput means 2,000 orders/sec. So they will need to increase single server throughput by a factor of 10.  According to Amdhal's Law they will need grow from 10 servers to 100 servers! Tripling the capacity is not even an option. The system already reached its maximum limit when it grew by a factor of 10.

And there's one more thing the team needs to tell the boss...

Ahem, regarding the budget...

For the sake of discussion, let's assume that the team was using free products to reduce software license costs. To meet their initial goal of 1,000 orders/second they had to use 10 machines for web servers and another one for the database. If each machine costs $10,000, total costs are $100k for the web-tier and another $10k for the data-tier, so we're talking $110k total cost of hardware.

In addition, the team spent a substantial amount of time on iterations to analyze the bottleneck and find a solution, wire the pieces together, make sure everything works and modify code. Let's say this development effort took 5 team members 3 months to complete, so it cost about $150k (again, I'm being optimistic). Total costs are at $260K.

As we've already seen, to double capacity, they will need 100 servers. Let's also assume that it will require a similar effort to develop, tune and test it  -- so $900k in additional hardware and another $150K for development. Total costs are now at $1.31 million.

Now off to triple the load... wait, we can't. The team now literally has to go back to the drawing board (a whiteboard in this case) and completely re-design and re-write the application -- if they can even figure out how to do it. Can we even measure the cost of such a scenario? We can probably measure the development cost, but what is much harder to measure is the amount of money the company is going to lose because it can't support the tripling of load.

It would be a false assumption that the company controls the situation, and has time to plan everything in advance. What if Mylie Cyrus shows up on the cover of Seventeen holding the super-cool gizmo the company sells and now all of a sudden there's a mad rush to buy it on the site? We need to assume that we can't predict the load. It doesn't happen when we plan for it. It can happen when we least expect it.

With social networks or electronic trading and e-commerce, such events can quickly lead to a viral effect or what's also known as an 'event storm'. Such a chain of events can quickly lead to disaster, such as a site crash, and the company is now all over the news and the blogosphere (and not in a good way), customers are frustrated and many defect to the competition. The trouble is that they can't fix it in a snap. Remember: they need a few months heads up to re-write the application.

Lessons Learned

There are a few things we can learn from this story:

  • direct costs of scaling grow exponentially with the demand for scale;
  • the cost of software licenses can be a very small factor in total cost;
  • indirect costs, resulting from the unpredictability of our system and the inflexibility that it imposes on our business, can have huge implications to the business, well beyond any measurable criteria. Even if your company can measure the direct losses from downtime due to lost sales or trades, you'll be hard-pressed to measure damaged reputation, loss of customer trust, and in some cases, the loss of your job!

A Better Approach: High-Throughput and Linear Scalability

Things don't have to be the way I described above -- and for a growing number of companies -- they aren't. In the blog post Twitter as a scalability case study, I detailed the principles that companies such as Google and Amazon follow, as well as those who use GigaSpaces, so I won't repeat them here.

Following those principles, we use a memory-centric solution that addresses architecture holistically (end-to-end) and is linearly scalable. Because it is in memory, it performs much better. GigaSpaces customers have seen throughput improvements of between 10x and 100x, depending on the existing architecture and the scenario. Let's say -- again, being very conservative -- that throughput with this approach is only 5x: 1000 orders/sec per server. This would mean that to meet the goal of 1,000 orders/sec we only need 1 machine (compared to 5). Even if the software cost of this solution is $20k/CPU, thanks to the reduction in hardware costs, we will end up with significant cost savings.

Perhaps more importantly, this approach follows a shared-nothing architecture (meaning each node is entirely self-sufficient) and is linearly scalable. As such, scaling simply requires adding more servers as needed, without code changes or complex provisioning. Moreover, if we know the throughput of a single server, we know exactly how many servers we will need to achieve future or unanticipated requirements. All we need to do is multiply the number of servers by the throughput per server. Remember, because it is linearly scalable, it does not suffer from diminishing returns, and there is no 'scalability wall'. So if one machine can process 1k orders/sec, two machines can process 2k orders/sec and N machines can process Nk orders/sec.

In the case of our team above, if they had taken this approach, they would have only needed two servers to double the throughput (with a total solution cost of $220k - compared to $1.31 million! ) and three servers to triple it (with a total solution cost of $250K - compared to who knows what). And if they had faced any unanticipated peaks (thanks to Mylie), they could have quickly scaled for that as well.

What's interesting about linear scalability, is that even if we assume no throughput increase at all PER SERVER, the cost savings are still remarkable. In our case, let's assume that the linearly scalable solution still has a throughout of 200 orders/sec per single server.   In order to achieve the 2,000 orders/sec throughput, we will just need 10 servers, compared to 100.

The example above illustrates that what initially appeared to be a low-cost solution (free middleware), became extremely expensive as scalability requirements grew.

Measuring the Cost of Scalability: A Cheat Sheet

Following is a short summary of what we should consider when measuring the cost of system scalability (or lack thereof):

  • Cost of hardware and software, as a function of:
  • How many CPUs (or machines) are required to achieve the desired throughput/concurrent users/latency, given the:
    • Throughput per single server
    • Level of contention, and therefore, the required number of servers needed to scale as prescribed by Amdahl's Law. This calculation needs to be performed given different scale requirements (2x, 3x, 5x, 10x, etc.), as it will grow exponentially
    • If we cannot achieve on-demand scalability -- we also need to consider the cost of hardware and software required for over-provisioning to ensure we can handle peak loads
  • Cost of development, QA and testing
    • Initial design and development cost, including learning curve and integration
    • Cost of re-designs and re-writes for when we need to scale our application
  • Cost of failure – what is the cost of downtime due to under-provisioning or inability to scale on-demand. This should consider direct revenues loss, compensation payments, loss of productivity, damaged reputation and future income and so on.
  • Provisioning, deployment and operations costs - including management and monitoring. The more complex the system is, the more difficult it is to identify bottlenecks and root causes.

A Final Note on Comparing Solutions:

In the context of scalability and ROI, when we evaluate competing solutions, we need to make sure that we are comparing apples to apples.

  1. Comparing Apples-to-Apples: It is not enough to measure the license cost. We need to normalize it with the performance and scale factors. For example, if two products cost the same in terms of cost/CPU, but one performs 5 times better, then the cost of that product is, in fact, one fifth of the other
    .
  2. Total Cost of Ownership (TCO): We cannot look at the software license costs alone. We need to assess the overall cost of the system, including hardware (and other platform costs, such as OS), additional components required (such as load balancers and storage), cost of development and cost of failure. In the final analysis, free products that are not linearly scalable are going to cost much more than commercial products that are.
       
  3. End-to-End Measurement: When it comes to scalability, you are only as strong as your weakest link, so assessing the cost of scaling requires a holistic measurement. Before we compare two products we need to understand how they each play a role in achieving end-to-end scalability. Linear scalability requires an end-to-end solution. Solutions that are built from a bundle of tiers and products are likely to be non-linearly scalable, as contention is created by the integration of the tiers and products, the need to ensure consistency, different clustering models and other issues. This means that before we can even measure or compare cost, we first need to compare what it takes to reach linear scalability with each product. It might be that on a simple caching or messaging level, two products provide the same level of scalability. When, however, we need to integrate the messaging system, use a transaction manager, a database or filesystem to ensure reliability, our  end-to-end scalability is going to be significantly limited.

This post was co-authored with Geva Perry.

May 18, 2008

Twitter as a scalability case study

So once again Twitter was down last week for a good chunk of time.

A lot has been said already about Twitter's scalability issues. Many have given Twitter as an anti-pattern of how not to deal with scalability and have suggested different solutions for scaling it. As Twitter is famously a Ruby-on-Rails deployment, this case has also been used as a weapon in the language/platform wars between the RoR and Java camps, and to a lesser degree, also with the LAMP (PHP) camp.

I tried to educate myself on the Twitter architecture, and came across this excellent summary by Todd Hoff on highscalability.com (BTW, Todd is doing an excellent job of collecting all this valuable information in a very professional way. Todd, keep up the good work!)

Reading through this summary, a pattern emerges. Twitter is no different than many other web apps that have become overnight successes.

They had a good idea and went to implement it as quickly as possible. Ruby seemed to be a perfect tool for them to get there quickly. Success was much bigger and faster then they imagined. Not surprisingly. the architecture was not designed for scalability and they are now forced to go through the painful process of scaling their architecture. There are similar stories about eBay, LinkedIn, MySpace and many other notable web companies. 

What's somewhat surprising is that they had to hit almost every possible bottleneck before realizing that they still have a scaling issue. For example, they used memcached to address the scalability of their (single!) MySQL database. They tried various messaging solutions (including a Linda-based implementation), and threw a lot of hardware at the problem. Now it seems they are looking to completely re-write the application, as noted here.

Recently I've encountered similar scenarios with GigaSpaces customers: one is a PHP-based app for "casual gaming" and another is a gambling application designed with a classic J2EE architecture. Both were facing similar scaling bottlenecks. The fact that we're seeing the same scalability issues in PHP, Java and obviously Ruby, tells us that the scalability problem is not about the language. It's about the architecture.

I hate to repeat myself with this cliché, but with scalability your only as strong as your weakest link. Trying to bypass this problem by plugging in point solutions is not going to cut it. It is therefore not surprising that those who dealt with scalability challenges successfully -- such as eBay, Amazon, Google and Yahoo -- went through architecture changes and eventually reached a similar pattern to the one I described in my summary of the Qcon conference: Architecture You Always Wondered About: Lessons Learned at Qcon

Scalability -- Doing It Right

  • Asynchronous event-driven design: Avoid as much as possible any synchronous interaction with the data or business logic tier. Instead, use an event-driven approach and workflow
  • Partitioning/Shards: Design the data model to fit the partitioning model
  • Parallel execution: Use parallel execution to get the most out of available resources. A good place to use parallel execution is the processing of users requests. Multiple instances of each service can take the requests from the messaging system and execute them in parallel. Another good place for   parallel processing is using MapReduce for performing aggregated requests on partitioned data
  • Replication (read-mostly): In read-mostly scenarios (LinkedIN seems to fall into this category well), database replication can help load-balance the read load by splitting the read requests among the replicated database nodes
  • Consistency without distributed transactions: This was one of the hot topics of the conference,   which also sparked some discussion during one of the panels I participated in. An argument was made that to reach scalability you had to sacrifice consistency and handle consistency in your applications using things such as optimistic locking and asynchronous error-handling. It also assumes that you will need to handle idempotency in your code. My argument was that while this pattern addresses scalability, it creates complexity and is therefore error-prone. During another panel, Dan Pritchett argued that there are ways to avoid this level of complexity and still achieve      the same goal, as I outlined in this blog post.
  • Move the database to the background: There was violent agreement that the database      bottleneck can only be solved if database interactions happen in the background. (NOTE: I recently wrote a more detailed post explaining how you can effectively move the database to the background.

This brings me to an interesting story about a startup company that has an ad engine using PHP. Like many others, when they became successful, they faced scalability issues and faced a similar dilemma to to the one faced by Twitter. Their story, written by the implementor, a consulting company called Rocketier, is a good example of how to do it right (in this case, using GigaSpaces):

As part of a project for a Web 2.0 company, which focuses on the Affiliate Junction market, Rocketier developed a generic connector from PHP to GigaSpaces XAP. The company's product serves as a hub which connects advertisers and publishers in a unique proprietary model, tracking and billing ad views, ad clicks and resulting sales in affiliate networks. The software was originally developed using PHP, however, once the number of customers started to grow, the product hit severe performance and scalability issues. The main bottleneck in the product was the ad view counting mechanism which effectively limited the software to an upper bound of 1M hits per day. Instead of replacing the entire product, Rocketier focused on the performance critical business processes and developed a backend GigaSpaces module, responsible for counting the ad views and clicks. This module was integrated into the software using a custom designed PHP connector, employing COM+ technology. The new solution is easily scaled up and out and can support the company's growing customer base up to 200M hits per day...

...This solution enabled the company to meet its business needs in a short period of time, while keeping a low risk factor due to the fact that only the performance critical business processes were replaced. The chosen technological solution added grid support to the PHP-based product, including: PHP and GS mediation (the aforementioned connector), application persistence, session based continuity, asynchronous programming capabilities and presentation and business logic separation.

To sum up, the company was able to leverage its existing investment and migrate evolutionarily from an initial prototype to a heavy load, production level solution (from Beta to a Post Digg Effect system), in a short time to market and with minimal risk"

 
Rocketier was kind enough to publish part of their work on
OpenSpaces.org.

What's interesting about their approach is the fact that they were able to address PHP scalability without throwing out the existing implementation. They used PHP as it was meant to be used: a front-end. They grid-enabled the processing engine using GigaSpaces. This approach is a good one for scalability of any web application whether it's Java, PHP, RoR, .Net, or J2EE.

Are we all doomed to go through this painful process when we are successful?

We seem to take it for granted that dealing with scalability is complex. When we start a new application it's hard to know whether we're ever going to be successful to the point where the investment in scalability is worth the effort. At this initial stage the important thing is time-to-market. We want to get our idea out there as quickly as possible. This is a reasonable desire as indeed most projects don't take off. 

Now imagine what would happen if dealing with scalability wasn't that difficult. That would have change the entire decision making process, and would enable Twitter and many others to start with a scalable architecture from day one, avoiding this painful process.

So the question is what would is required to simplify building a scalable application to the point in which it is as simple as building it for a single machine?

From my experience, most challenges have already been faced by and dealt with others - so the first thing that I did was look at how others (not necessarily in the same industry) addressed this issue.

In this case, storage virtualization is a good example. At first, we used local disks. Local disks tend to get filled-up quite fast. It was very hard to deal with this problem as it required replacing the disk with a bigger one every time full capacity was reached. IT had to go through this process for every user and every application -- very painful and costly. The solution came in the form of NAS and SAN, or network-attached storage. Instead of using a local disk, use a virtual disk that resides somewhere on the network. The user and the application don't need to be aware of it, because they use a local disk driver that virtualizes the network devices to make them look as if they were just another local disk. The application scales but hasn't changed as a result. We can add and remove devices as we wish with no changes to the application. Later on, if there is a more cost effective solution available, we can easily replace the devices.

We can apply the same concept of virtualization to the middleware stack -- namely the data, the messaging and the processing -- with the same degree of simplicity. The application interacts with a "proxy" that hides the details of how a message or update operation is routed, how fail-over is handled, how data is partitioned and so on.

With services such as Amazon EC2, and other cloud environments, this can be made even simpler, as we can have a pre-configured image and hardware ready for deployment. All we need to do is just deploy our business logic. (See an EC2 example here).

With today's frameworks architectures, we don't have to go through the same painful experience. We can build scalable architectures from the get-go. I would even argue that it takes less time to build applications with this approach than the traditional client-server approach.

In my next post I'll discuss the difference between TCO and software license costs in scalable applications.

April 30, 2008

Cool Projects on OpenSpaces.org

The OpenSpaces.org community site launched in January. I was surprise by the rapid adoption of OpenSpaces since then, with lots of interesting innovations on things I didn't even think of. I'm sure that some of the projects will be very useful to many OpenSpaces users. This shows the value behind  an ecosystem and community. Given the right tools, people will start collaborating and share things that otherwise would be buried in their hard disk, or in their mind.

The OpenSpaces.org site also provides a great tool for GigaSpaces Partners and individuals in the general developer community to expose their skills by publishing valuable content. A good example is GridDynamics, a GigaSpaces partner, who invested time and effort on producing high quality, well-documented projects.

The same goes for various people on the GigaSpaces team who came up with great ideas based on work that they did with customers. They use the OpenSpaces,org platform to share the tools they developed with other users in the community who might have similar needs. For example, the OpenSpaces demos project shows how to integrate Ajax, Spring MVC and OpenSpaces to scale a typical web application (market data front end, in this specific case). 

Another good example is TGris, an extension of the testing grid framework that we use internally at GigaSpaces, and which several customers showed interest in for automating the testing of their own applications (note that the tool is not specific to OpenSpaces).

Another class of  interesting projects are those that integrate OpenSpaces with various frameworks and APIs. These projects simplify the integration and adoption process, and shorten time-to-value. Good examples are the projects that provide integration for OpenSpaces/GigaSpaces with Amazon SimpleDB, JPA, and Memcached , as well as the  Cache Integration project, which enables OpenSpaces/GigaSpaces support for many frameworks, such as Acegi Security, Cocoon, Jetty, iBatis, OpenJPA, Velocity and others.

Other people built entire functional applications,  such as Leonardo Gocalves's  GoDo - Goods Donation System (see details below), and Jim Liddle's MobileGSFeed, which provides a scalable solution for handling Atom feeds through the iPhone. Jim actually runs our Sales in the UK & Ireland. Never in my dreams did I imagine that OpenSpaces.org would be used by sales guys :-)

Anyway, I'm very pleased to let you know that we reached an important milestone for OpenSpaces two weeks ago when we reached the deadline of the developer contest. Fourteen candidates made it to the final stages. Only three will be finalists. A distinguished panel of judges interviewed each contestant. The judges are Adrian Colyer, CTO, SpringSource; Joe Ottinger, Editor, TheServerSide.com; John Davies; Julian Brown, Architecture Consultant, RWE;  Keerat Sharma, Platform Engineer, Gallup; and Ross Mason, Co-founder and CTO, MuleSource.

All of the candidates put up a real good fight and made it very hard for the judges to reach their final decision. The winners of the contest will be announced in a nice venue in Prague during TheServerSide Java Symposium event. Stay tuned for updates on the exact date and venue here and on The GigaSpaces Blog and web site. We also intend to publish interviews with each of the finalist project owners and post them in a blog.

Here are some of the interesting projects (in alphabetical order). The full list of projects can be found here.

Please join one of the projects or start a new one yourself. If you already developed something, but are concerned about the time it will take to initiate a new project -- don't be! It is extremely easy and quick to start a new project and if you need any help, we're ready to support you.

 

 

 

 

April 29, 2008

Nice article on Space-Based Architecture

I Just came across a very good article by Clara Ko summarizing the concept behind Space-Based Architecture on Java Pulse. The article is based on our recent white paper Scaling Spring Application In 4 Steps.  Clara's article is available here. I highly recommend reading it.

I particularly liked the way Clara summarized the problem with tier-based architectures:

GigaSpaces argues that tier-based architecture is the nemesis to linear scalability. In fact, the architecture itself is the bottleneck, not unoptimized component. This is because in a tier-based architecture, as scability patches are introduced, the communication between the tiers become more complicated and expensive. The complex interaction between tiers and between physical machines in the deployment cause problems with latency and data consistency. Possible problems with tier-based architecture are bottlenecked access to the database, bottlenecked access to a centralized messaging provider, messaging overhead, network latency, inefficient clustering, and unreliable failover recovery. GigaSpaces addresses each of those problems by providing a platform that enables space-based architecture.

Couldn't say it any better.

April 22, 2008

GigaSpaces Dashboard Using Hyperic plugin

 Alexey Kharlamov created  a project that implements a Hyperic plugin for GigaSpaces. It is available as an open source project on OpenSpaces.org.

Hyperic is open source software for web infrastructure monitoring and management. Among its customers are Hi5 and Contegix. It supports Apache, JBoss, Tomcat, Spring, Weblogic and many other products.

I know of many of GigaSpaces users and customers who expressed interest in having such a customizable monitoring tool.

Now is your chance to join this project, use it, extend it and optimize it.

You can find details of this work here:

http://support.hyperic.com/confluence/display/hypcomm/Gigaspaces

Hgmserver_view_cr

This project also shows the value of OpenSpaces.org as a tool for extending the ecosystem around  the OpenSpaces development framework and GigaSpaces XAP.

 Alexey, very impressive work!

 

April 20, 2008

Google App Engine - what about existing applications?

Recently, Google announced Google App Engine, another announcement in the rapidly growing world of cloud computing. The announcement led to an interesting discussion on InfoQ, which included topics such as which applications are appropriate for the cloud and issues of vendor lock-in:

Floyd Marinescu gives his perspective on how this will impact that next generation of application platforms:

I think Google App Engine is the beginning of a whole new category of cloud computing offerings, making the total set in my view loosely similar to the following:

  1. Grid or Master/Worker implementation clusters. This is the traditional view of cloud computing, where you just have a master/work type programming model with the workers being executed transparently across a grid of computers. This type of stuff is happening behind the firewall, I don't know of any internet/publically exposed services that do this.
  2. The internet becoming a new middleware platform. This is where Amazon, Microsoft, Yahoo, and are playing. Middleware api's/products that previously were installed in the datacenter and charged on a license fee basis are now moving online with web service accessesible API's, charged on a utility computing model. Things like compute (Amazon EC2), Storage (S3, Microsoft SQL Data Services), Queing, domain-specific data sources (the realm of mashups and RSS), are part of this 'trend'. The internet is starting to provide middleware building blocks that are mashed up to build applications. Users can see massive cost savings compared to buying similar tools and maintaining them themselves in the data center.
  3. The internet as an application hosting platform.So whereas the previous trend is about exposing middleware constructs on the net, this trend is about exposing a whole end-to-end application stack with API's for everything from MVC/web to messaging to data storage (entities) built in. As a developer now you don't even need to think about things such as scalability, about data storage format, etc. You just deploy your app to this 'platform cloud' and it just works. Google's cloud computing offering is thus a higher-up-the stack offering compared to Amazon.

Galen Gruman and Eric Knorr give a good review of cloud computing in a recent Infoworld article, entitled What cloud computing really means, in which they categorize this type of offering as Platform as a Service:  

Platform as a Service: Another SaaS variation, this form of cloud computing delivers development environments as a service. You build your own applications that run on the provider's infrastructure and are delivered to your users via the Internet from the provider's servers. Like Legos, these services are constrained by the vendor's design and capabilities, so you don't get complete freedom, but you do get predictability and pre-integration. Prime examples include Salesforce.com's Force.com and Coghead. For extremely lightweight development, cloud-based mashup platforms abound, such as Yahoo Pipes or Dapper.net.

What's interesting about all this is the gap that still exists between the Platform as a Service delivered by cloud providers and enterprise application developers. It is clear that neither Google nor Amazon thought deeply about what it means to run existing applications on the cloud, especially those already written with JEE or .Net. What's also interesting is the assumptions they are making with regards to the trade-offs among consistency, performance and reliability. For example, if you look at Amazon's SimpleDB service, the Amazon message queue and other web services, it is clear that they were designed to suit a certain type of application, designed in a very specific way..

In addition, and looking at Google's offering especially, it seems that they came up with their own way of doing things and are essentially dictating that way to anyone who wants to use their platform. As Jason Carriera commented in about Google (in comparison to Amazon):

I can't understand who'd want this... With Amazon Web Services, you get a VM hosting service where you can run any AMI

So while Amazon is still more open than Google, if you want to take full advantage of their Platfrom as a Service offerings, you are pretty much in the same boat.

This brings up some very serious questions:

  • If we want to take advantage of one of the clouds, are we doomed to be locked-in for life?
  • Must we re-write our existing applications to use the cloud?
  • Do we need to learn a brand new technology or language for the cloud?

Geoffrey Wiseman's comment on the InfoQ thread highlights some of the common concerns:

..that scares me precisely because of the vendor-lock-in. I mean, I wouldn't necessarily be shocked to have an ISP reserve that kind of right -- only I would retain the option of hosting it somewhere else. With GAE [Google App Engine], if Google decides they want to take down your app, what's your next option?...

Garett Rogers published his view on ZDnet The problem with Google App Engine, in which he outlined similar risks to those described by Geoffrey:

  • You are putting your application in Google’s hands

Think about that for a minute. You are at the mercy of Google — if disaster strikes and Google one day disappears, you are done to. Or, more realistically, if the Google App Engine goes down for an hour, you are also down for an hour — and you will have no idea what happened.

  • It’s free right?

Not only are you locked in, you are completely at the mercy of Google’s future pricing strategy for the Google App Engine…

  • Privacy should not be taken lightly

     Google has a very strong privacy policy — and personally I trust them. However, I’m trusting them with my personal information — you will be trusting them with all of your company’s data?...

  • Once you are in, you are really in

      Using Google’s infrastructure is very tempting. But any smart company should have some sort of plan for the future…

It is clear that cloud computing is still in its early stages and it is, therefore, hard to tell how it's going to shape up. It's also very likely that we will see increasing competition among the cloud hosting providers which means that the last thing we want to do is lock ourselves to a specific provider's platform.

So how do we go about that?

The most important approach I would recommend is abstraction: write your code in a way that is agnostic to the environment in which it will be running. This may sound trivial, but it's not that simple. Following are  some practical suggestions for a practical way of doing this:

  • Use development frameworks, such as Spring, Hibernate and Mule, to abstract your application from the underlying runtime implementation. Interestingly, now that Java has built-in support for scripting  languages you can also write pieces of your code in any of those languages, even keeping yourself abstracted from the use of Java.
  • Use the runtime platforms of your choice to deliver messaging, parallel processing and data services based on cost/value. At the end of the day, it's the runtime platform's responsibility to deliver its services in a virtualized/cloud environment and abstract the complexity involved from the application using it. 
  • Use the cloud of your choice. A cloud could be anything from your local PC to your corporate data center to a public Internet cloud such as Amazon EC2. 

After applying all of these principles, here's how your new application stack will look like: