« May 2008 | Main | July 2008 »

June 2008

June 27, 2008

TSSJS Prague: my take-aways

Once again TSSJS was a well-organized event with lots of interesting content. Hot topics that I took  notice of were RIA, new languages, and obviously distributed computing and scalability.

I arrived on Tuesday morning, which gave me a chance to meet John Davies, Ted Neward, Kirk Peperdine and Holly Commins. We found a nice spot not to far from Charles Birdge. At some point we started discussing the reasons we're seeing a burst of new languages. The discussion about languages is thought-provoking. Ted Neward (One of my favorite presenters) seems to be spending a lot of his time recently thinking about this topic. He explained over dinner (while he was completely jet-lagged!) his view. I'll try and summarize the main points:

  1. One size doesn't fit all - we shouldn't try to force one language to do everything and expect it to be good at it all. The concept of using multiple languages in the same application is actually something we've been practicing for a while by using a combination of HTML, CSS, XML, JavaScript, Java, etc. Each language serves a specific purpose.

  2. Different semantics require different expressions, i.e., different languages. An example that was given was Scala and Erlag and the notion of parallel programming as a first class-citizen in the language (as opposed to a set of libraries and explicit APIs in Java). The argument (brought up by Kirk) is that you can't leverage multi-core platforms without languages that were designed to do so.  It reminds me that  indeed multi-threaded programming  wasn't  common until threading became native in the languages. Now you can't think of writing even a simple application without threads. So i think that Ted and Kirk have a valid point.

  3. Usability and productivity - how many lines of code are required to express a certain idea? There are many examples that show how different use cases in Java could be relatively verbose and complex with comparison to the "new" languages.

  4. JVM/CLR makes it easy to introduce new languages as just new views and perspectives running on the same platform. Previously, languages such as Perl and TCL had to be built with an entire stack, typically based on C or C++, and had to be ported to various platforms and operating systems. This approach made the choice of language and language interoperability quite difficult as the decision to choose one languages over the other was considered a "catholic marriage". Today, the JVM in Java and the CLR in .Net enable better separation of concerns. They provide a common platform that can easily support multiple languages. This simplifies interoperability of different languages within the same application. A good example is the new support for dynamic languages in Java 6 and in .Net. This makes the language decision simpler, as the impact of this decision on our project is less drastic and less risky than it was before.

While I think that all of the points are valid I couldn't avoid thinking that we're forgetting past experience. For example, you could easily argue that lines-of-code is only one measurement of productivity. Another measure of productivity is maintenance, i.e., how simple it is to read the code and understand it, transfer it to another programmer, etc. My concern is that if the language becomes too flexible and enables each of us to write our own extension, we're going to find ourselves in a position where the only person that understands the code is the person who wrote it, and even that would hold true only for a certain period of time. Think about C++ templates, macros, operator overloading, multiple inheritance  -- a lot of "nice" features that made our code very flexible, but less readable due to the large number of indirection we had to go through to parse a single class.

One of the things I liked when I switched to Java from C++ was the fact that to understand my colleagues' code all I needed was to read their .java files. In most cases I didn't really need documentation and it was fairly simple to parse the code because Java restricted much of the flexibility that I just mentioned. Trying to do the exact same thing with C++ requires parsing of header files, macros and typedefs. Another issue is that introducing multiple languages can be quite complex and a barrier to productivity, due to limited skill sets within a certain project, even if choosing a language is less risky than before.

I think the concurrency argument is only a temporary one. I'd hate to choose a different language just for that, because it's something that I expect to see native in Java. So far we managed to deal with multi-core and parallel programming quite effectively with Java using event-driven architecture (EDA) and master/worker patterns, and abstract a lot of the concurrent programming with things such as Futures, Remoting, etc. Surely having some of the features of Scala or Erlang as part of Java would have made our life simpler, but if I measure the value vs. risk involved, I'm not sure it justifies using it right now in a real project.

Don't get me wrong. I'm not saying that there is anything wrong with these languages. What I'm arguing is that we need to be very careful before we choose them and make sure that we're measuring the right value, rather than assuming that any of the above arguments applies to our application without proper analysis. Ted was able to convince me to look further into this topic - so I'm probably going to give Scala a try and get a real feel of it.

The event started on Wednesday morning  with a very good presentation by Stephan Janssen. Stephan is the founder of Parleys.com. He is also the founder of the JavaPolis conference held annually in Belgium. He talked about his experience with a wide range of RIA platforms: DHTML, Adobe Flex/Air, JavaFX, Google Web Toolkit (GWT) and Microsoft Silverlight. He discussed his personal experience in using the various technologies as part of Parleys.com.

The combination of a general overview with real-life examples made the discussion quite interesting and lively. The bottom line of this part of the talk was to use Adobe Flex, if you're building a site in the short-term and JavaFX if you're planning on launching your site in about a year's time - due to the maturity cycle and the gaps between the two technologies. Personally I found the fact that there are so many options to do the same thing quite confusing. I wish that we could press the fast-forward button on the maturity cycle of these technologies. Working with previous versions of Parleys.com I must say that i was very impressed with the progress and the *right* use of technologies to build the new version of the site.

Another interesting and quite innovative idea that Stephan presented was about hosting services and collaboration with academic partners. The hosting service will enable companies like ours to host their live presentations in the Parleys.com site. In addition, you can embed the presentation in a blog entry. You can also record your talk online, using a web-based application. The partnership with academic institutions enables scaling not just the content, but also the bandwidth, similar to the way downloads works. IMO Parleys.com could easily become the YouTube of online presentations. If you missed the presentation I'd recommend watching this interview here:

My own presentation, Getting ready for the cloud, seems to have been well-received, although I had some concern that it might be too high-level for some of the audience. You can read some of the comments posted by others here and here.

The presentation included a live demo of a web 2.0 application (displaying market data) running live on Amazon EC2. Although the demo ran over a wireless line, it went surprisingly smooth and I was able to easily redeploy and relocate instances through a simple drag & drop using our UI, which was hosted on one of the EC2 machines. The following day, Uri Cohen gave a session in which he showed the details of what's going on behind the scenes and reviewed the actual code and API used in the demo. If you're interested in experiencing it yourself, you can try out the same demo on our new EC2 version

TSSJS was a good opportunity to meet in person the winners of our OpenSpaces developer competition. I heard interesting stories about what drove them to write their projects. The common theme was the technology challenge - they heard about our technology and scaling pattern and wanted to get a feel for themselves of how it works. You can listen to some of the stories in recent podcasts we published here here and here.

BTW, Jason Carreira, one of the winners, has since worked on another project: a scalable Twitter-like application using GigaSpaces and EC2 (an alpha version already exists, he is now looking for hosting opportunities). And Leonardo Goncalves -- the first prize winner -- is already thinking of the next version of his project. The third winner - Kirill Ishanov -- is also planning to participate in next year's contest. At the end of the first day we showed a video of some of the judges (John Davies and Jullian Browne were missing from the video). It’s a light-hearted video in which the judges also makes fun of Joe Ottinger:)

   
Two of the talks I very much enjoyed were given by John Davies, formerly the founder and CTO of C24 (which was sold to IONA), who has recently started a new venture called Incept5. I've worked with John for many years now and we often have excellent chats about ideas in our respective markets. John's first talk was one I'd heard before, but as always, he updated it with new anecdotes and ideas. He talked about extreme enterprise architectures, specifically ESBs and grid in the low-latency, high-volume, complex envrionments of investment banking. John started by explaining the value of a millisecond to the high-end institutions, literally in terms of dollars, something like $100 million per ms. He went on to talk about compiled languages compared to Java for this sort of processing. It was interesting to see John walk through a very high performance matching and reconciliation engine we had designed together for a client a few years ago and it's exciting to hear that his new company will specialize in this is. John talked about some of the clever coding patterns that had to be implemented to provide linear scalability, and although master-worker was the pattern of choice for scaling, it wasn't as simple as just writing lots of workers.

John's second talk was new to me, and although we discussed the ideas in it in the past, it was fascinating to hear them presented. The room was packed -- standing room only -- as it was a topic near and dear to the hearts of many developers and architects: "The Enterprise without a Database". I thought this would just be an extension of caching, but John went on to emphasize the huge amounts of time and energy (human) being lost on Object-Relational Mapping (ORM). Why do we still persist our well-established Object-Oriented models into a relational database? While ORM is simple at the example level, it doesn't scale given the levels of complexity in today's messaging standards. John made this very clear by example. I got the feeling that he was holding back the solution, perhaps to be released by his new company, but it was clear that there are alternatives to ORM: from caching objects to using CLOBs in classic databases. This is obviously an area to watch, as John always has a good vision for these sort of things.

At the end of the day I had the chance to have a beer with different people in a nice Mexican restaurant in Prague (courtesy of Jodie, the cameraman, and his local friend). After a few beers, mojitos and lots of peanuts (courtesy of John Davis:)), the topic of open source software (OSS) came up. I think that we all agreed that being open is a no-brainer, and that's the way software products should be built. Being open doesn't necessarily mean free - take Jive and Atlassian, for example. They sell commercial products, but they provide customers with the source code

Another model is the dual-license model such as that used by Red Hat and MySQL. It's sometimes referred to as the Fedora model. It means that you have a choice to use a free version but if you do, you're on your own. If you choose to use the supported version, you're going to be charged a subscription, for which you'll get extra features and better packaging/documentation.

I argued that it is important to have a solid business model behind a product/project. It should be as important to the users as to the company developing the product. if a product doesn't have a solid business model two things might happen: the project/product is going to be abandoned at some point due to lack of funding, or the owner of the product will change the licensing model to monetize on the IP and established user-base. We've seen both scenarios happening already.

I also argued that the Fedora model is usually successful only as part of a commoditization strategy. For example, JBoss's strategy was to go after the lower end of WebLogic/WebSphere accounts. The same applies to MySQL. This strategy seems to only work to a certain limit. I argued that this model is not proven in an emerging product category, where large investments in market education and innovation are required to achieve massive and sustainable adoption. In such cases, the Jive/Confluence model seems to be a better fit. Anyway, this topic is worth a separate discussion, so I'll leave it at that for now.

Unfortunately I had to leave on Friday (to be at my daughter's end-of-year party at school), so I missed Shay Banon's presentation. Based on what I heard it went very well.  You can view Shay's presentation here.

Anyway, it was a real fun event and i look forward to next year.

June 24, 2008

InfoQ Article - RAM is the new disk...

Steven Robbins published an interesting article on InfoQ titled  RAM is the new disk.
In the comment thread, Steven Robbins quoted Tim Bray and others comparing file system performance to memory:

Memory is the new disk! With disk speeds growing very slowly and memory chip capacities growing exponentially, in-memory software architectures offer the prospect of orders-of-magnitude improvements in the performance of all kinds of data-intensive applications. Small (1U, 2U) rack-mounted servers with a terabyte or more or memory will be available soon, and will change how we think about the balance between memory and disk in server architectures....

It raises the following questions:

What if the disk were RAM-based? Does that mean that all we need to do is replace the current disks with RAM technology to gain speed? The title of the article leads people to think along those lines.

My own take:

It's not just the speed of memory compared to disks that makes a difference. It's not even the extra benefit of the collocation of CPU and memory. What's really a important is the fact that disk is a sequential storage medium that was designed primarily to store a stream of bytes, not tables of data. That means that if you want to store data objects you need to serialize them into bytes, map sectors in the file system that points to the location of those bytes. Maintaining an index to this data is a relatively expensive operation as every additional index is stored as a copy of the original data, there is no real option to access data by reference, etc. If you think about it, existing RDBMS are basically a mapping layer between data-tables representations and sequential storage. A large part of existing database implementations is spent on addressing the impedance mismatch between the two representations models. All this complexity doesn't really exist when we're dealing with memory. That means that if will take existing databases and run them on memory based devices we're basically going to force the limitations of sequential storage representations into memory.

To exploit the real value of memory based resources we need to have different approach and implementations that assume that data can be accessed by reference - that objects can be accessed directly from our application without complex mapping layer in our native application domain.

At this point I'd like to end with Tim's last remark:

Disk will become the new tape, and will be used in the same way, as a sequential storage medium (streaming from disk is reasonably fast) rather than as a random-access medium (very slow). Tons of opportunities there to develop new products that can offer 10x-100x performance improvements over the existing ones.

June 16, 2008

Meet us @ TSSJS Prague

TheServerSide has a good record of picking nice spots for their conferences, and this year's Java Symposium in Prague is no exception.

It's looking to be a fun event, as I'm going to meet not just lots of old friends, but also the winners of our first OpenSpaces developer contest. I've already written about the contest and some of the submissions in a previous post. If you haven't already, check it out ,as we decided to continue with the contest next year and use  TSSJS as an opportunity for attendees to apply for "early bird scholarships" worth $1,000 each -- all this at an award ceremony we're holding in Prague during the show. Besides free booze and food at this event, we're going to show a nice video featuring the judges of the competition. Those who worked with Alit through the current competition will probably be happy to know that she is going lead this part of the show together with John Davis who was one of the Judges.

I'm going to be there with some of my colleagues from GigaSpaces, namely Shay Banon and Uri Cohen.
My presentation is titled Getting ready for the cloud but it really talks about the next wave in distributed computing in which clouds plays an important role and have the potential of changing many of the things we used to do in the past.

Banon will be talking about some of the work that he's been doing with Mule, Lucerne/Compass and Spring in his session Beyond Data Grids. I've seen him discussing some of these topics in Las Vegas this year, so I know it's going to be really interesting. Last time it sparked many questions about how clustering technologies can deal with scaling challenges, how in-memory data grids can replace or co-exist with traditional databases, and how they can be applied to different frameworks given real life examples.

Uri is going to talk about his experience in building scalable web 2.0 applications using Ajax, Tomcat and Spring MVC, and running on the Amazon EC2 cloud. He will discuss specific patterns for dealing with Ajax scaling issues, and also provide patterns and tips for moving from a tier-based to a scale-out model based on recent work he's done with JBoss and, of course, GigaSpaces.

The TSS event is also going to be a good opportunity for us to expose some of our latest development in our upcoming 6.5 release,such as the new Service Virtualization Framework (based on Spring Remoting), Dynamic language support, extended support for hibernate and enhanced database integration, built-in Maven support, support for Spring 2.5 annotations and enhanced administrative and real-time monitoring.

Mule users will also benefit from our extensive support for the Mule ESB. We're also going to show some of the latest developments with EC2 and cloud computing environments. Even though TSS events tends to be Java-centric, I believe that Java users will be happy to learn about our interoperability among Java, C++ and .Net. For those unfamiliar with it, I would recommend giving it a closer look as it provides high performance and an extremely simple alternative for making the language barrier pretty much obsolete.

There is much more to it than I can cover in this post. In fact, we realized that an entire post will not be enough to cover all the relevant content of our 6.5 release, so expect to see several dedicated posts in the coming weeks -- here and on the GigaSpaces Blog -- covering different aspects of new features, including some "behind-the-scenes" stories. Stay tuned!

June 05, 2008

Economies of Non-Scale

Scalability forces us to think differently. What worked on a small scale doesn't always work on a large scale  -- and costs are no different.

To measure the cost impact of scaling, let's look at the amount of resources required to scale to a given level. We'll use Amdahl's Law as a method to measure the amount of required CPUs. This will provide us a proxy for hardware and software costs. Later on we'll also review other costs related to the process of scaling. What's going to come up clearly in this analysis is that the cost of scalability grows exponentially in non-linearly scalable applications.

I'll also argue that scaling is not just a technical issue: it has a direct impact on our business and its ability to effectively compete.

Measuring the Impact of Scalability on Hardware and Software Costs

The following analysis is based on Amdahl's Law, which is described well in this Wikipedia entry. Put briefly, the negative impact of the non-scalable portion of our application grows exponentially relative to the scaling requirement, up to the point in which adding resources will not improve performance/throughput, as seen in this diagram:

 

Amdhal_law_2

By non-scalable I mean the part that is serial (non-parallelizable) -- or to use  terminology that is more relevant: the level of contention in the application. Contention can be thought of as the percentage of time operations wait on things such as shared table-locks in the database, persistent queues or distributed transactions.

The diagram above shows that if 90% of our application is free of contention, and only 10% is spent on a shared resources, we will need to grow our compute resources by a factor of 100 to scale by a factor of 10! Another important thing to note is that 10x, in this case, is the limit of our ability to scale, even if more resources are added.

Now let's see what it will cost us -- in terms of CPUs and software licenses -- to increase our scalability by a factor of 10, assuming that we have only 10% contention (and that's a fairly optimistic scenario with prevalent tier-based architectures).

 

Costofscaling_4

Take-Aways

There are two key take-aways from this analysis:   

  1. The cost of non-linearly scalable applications grows exponentially with the demand for more scale.
  2. Non-linearly scalable applications have an absolute limit of scalability. According to Amdhal's Law, with 10% contention, the maximum scaling limit is 10. With 40% contention, our maximum scaling limit is 2.5 - no matter how many hardware resources we will throw at the problem

An Example

The following is inspired by numerous true stories I have seen at GigaSpaces customers.

A team is tasked with building an online order management application that needs to process 1,000 orders per second. They choose a typical n-tier architecture with a web server as the front-end and a database as the data-tier. Note: In this case I give the web-tier as the front-end. In some systems the front-end is actually a messaging system. While the implementation is rather different, the behavior for the purpose of this discussion is the same, as both represent different forms of feeds or service requests.

Let's assume that the team designed the architecture by the book to ensure 100% reliability and consistency. This means that every critical transaction is stored in the database.

They now found that a single web server can handle only 200 transaction/sec, so they decide to put a load-balancer on the front-end, and deploy 5 web servers to meet the 1000 orders/sec requirement. At this point they realize that despite the fact that they increased the amount of servers by 5, the total number of orders/sec didn't really increase by much -- and only reached about 400 orders/second.

They start to monitor the system, and find that the application spent 40% of its time on shared locks in the database. As we already saw above, with 40% contention, we can only increase the throughput by a factor of 2.5 -- or 500 orders/second, so no matter how many web servers are added, the application will never be able to meet its throughput goal.

So the team decides to reduce the contention by placing a distributed cache in front of the database, which will reduce the hits on the database. They are cost-conscious so they select a free product -- memcached -- as the caching solution. As memcached cannot serve as the system of record (it doesn't support transactions, queries, is not highly-available, etc.), adding it reduces the contention,  but does not eliminate it.

For this example, let's take an extremely optimistic scenario and assume that memcached reduces the contention from 40% to 10%. According to Amdhal's Law, increasing throughput by a factor of 5 with 10% contention, will require 10 servers. Now the team is happy! 

Just as they get the system to work as expected, the boss knocks on the door and says: "Hey,  we're going to launch a major campaign for the holiday season, and marketing anticipates double the number of visitors on our site. If we're very successful, we might even triple the normal traffic. Can we support this?".

The team is in initial shock. "Double? Triple? Are you kidding me!? We just worked our ass off to get this application to work for the existing load, and now he's talking about doubling and tripling the load as if it were trivial?" The team now faces the prospect of explaining to management that they don't know how to achieve this capacity, or even worse, that they are incapable of it (while their competitor has already achieved it last year). Let's see: double throughput means 2,000 orders/sec. So they will need to increase single server throughput by a factor of 10.  According to Amdhal's Law they will need grow from 10 servers to 100 servers! Tripling the capacity is not even an option. The system already reached its maximum limit when it grew by a factor of 10.

And there's one more thing the team needs to tell the boss...

Ahem, regarding the budget...

For the sake of discussion, let's assume that the team was using free products to reduce software license costs. To meet their initial goal of 1,000 orders/second they had to use 10 machines for web servers and another one for the database. If each machine costs $10,000, total costs are $100k for the web-tier and another $10k for the data-tier, so we're talking $110k total cost of hardware.

In addition, the team spent a substantial amount of time on iterations to analyze the bottleneck and find a solution, wire the pieces together, make sure everything works and modify code. Let's say this development effort took 5 team members 3 months to complete, so it cost about $150k (again, I'm being optimistic). Total costs are at $260K.

As we've already seen, to double capacity, they will need 100 servers. Let's also assume that it will require a similar effort to develop, tune and test it  -- so $900k in additional hardware and another $150K for development. Total costs are now at $1.31 million.

Now off to triple the load... wait, we can't. The team now literally has to go back to the drawing board (a whiteboard in this case) and completely re-design and re-write the application -- if they can even figure out how to do it. Can we even measure the cost of such a scenario? We can probably measure the development cost, but what is much harder to measure is the amount of money the company is going to lose because it can't support the tripling of load.

It would be a false assumption that the company controls the situation, and has time to plan everything in advance. What if Mylie Cyrus shows up on the cover of Seventeen holding the super-cool gizmo the company sells and now all of a sudden there's a mad rush to buy it on the site? We need to assume that we can't predict the load. It doesn't happen when we plan for it. It can happen when we least expect it.

With social networks or electronic trading and e-commerce, such events can quickly lead to a viral effect or what's also known as an 'event storm'. Such a chain of events can quickly lead to disaster, such as a site crash, and the company is now all over the news and the blogosphere (and not in a good way), customers are frustrated and many defect to the competition. The trouble is that they can't fix it in a snap. Remember: they need a few months heads up to re-write the application.

Lessons Learned

There are a few things we can learn from this story:

  • direct costs of scaling grow exponentially with the demand for scale;
  • the cost of software licenses can be a very small factor in total cost;
  • indirect costs, resulting from the unpredictability of our system and the inflexibility that it imposes on our business, can have huge implications to the business, well beyond any measurable criteria. Even if your company can measure the direct losses from downtime due to lost sales or trades, you'll be hard-pressed to measure damaged reputation, loss of customer trust, and in some cases, the loss of your job!

A Better Approach: High-Throughput and Linear Scalability

Things don't have to be the way I described above -- and for a growing number of companies -- they aren't. In the blog post Twitter as a scalability case study, I detailed the principles that companies such as Google and Amazon follow, as well as those who use GigaSpaces, so I won't repeat them here.

Following those principles, we use a memory-centric solution that addresses architecture holistically (end-to-end) and is linearly scalable. Because it is in memory, it performs much better. GigaSpaces customers have seen throughput improvements of between 10x and 100x, depending on the existing architecture and the scenario. Let's say -- again, being very conservative -- that throughput with this approach is only 5x: 1000 orders/sec per server. This would mean that to meet the goal of 1,000 orders/sec we only need 1 machine (compared to 5). Even if the software cost of this solution is $20k/CPU, thanks to the reduction in hardware costs, we will end up with significant cost savings.

Perhaps more importantly, this approach follows a shared-nothing architecture (meaning each node is entirely self-sufficient) and is linearly scalable. As such, scaling simply requires adding more servers as needed, without code changes or complex provisioning. Moreover, if we know the throughput of a single server, we know exactly how many servers we will need to achieve future or unanticipated requirements. All we need to do is multiply the number of servers by the throughput per server. Remember, because it is linearly scalable, it does not suffer from diminishing returns, and there is no 'scalability wall'. So if one machine can process 1k orders/sec, two machines can process 2k orders/sec and N machines can process Nk orders/sec.

In the case of our team above, if they had taken this approach, they would have only needed two servers to double the throughput (with a total solution cost of $220k - compared to $1.31 million! ) and three servers to triple it (with a total solution cost of $250K - compared to who knows what). And if they had faced any unanticipated peaks (thanks to Mylie), they could have quickly scaled for that as well.

What's interesting about linear scalability, is that even if we assume no throughput increase at all PER SERVER, the cost savings are still remarkable. In our case, let's assume that the linearly scalable solution still has a throughout of 200 orders/sec per single server.   In order to achieve the 2,000 orders/sec throughput, we will just need 10 servers, compared to 100.

The example above illustrates that what initially appeared to be a low-cost solution (free middleware), became extremely expensive as scalability requirements grew.

Measuring the Cost of Scalability: A Cheat Sheet

Following is a short summary of what we should consider when measuring the cost of system scalability (or lack thereof):

  • Cost of hardware and software, as a function of:
  • How many CPUs (or machines) are required to achieve the desired throughput/concurrent users/latency, given the:
    • Throughput per single server
    • Level of contention, and therefore, the required number of servers needed to scale as prescribed by Amdahl's Law. This calculation needs to be performed given different scale requirements (2x, 3x, 5x, 10x, etc.), as it will grow exponentially
    • If we cannot achieve on-demand scalability -- we also need to consider the cost of hardware and software required for over-provisioning to ensure we can handle peak loads
  • Cost of development, QA and testing
    • Initial design and development cost, including learning curve and integration
    • Cost of re-designs and re-writes for when we need to scale our application
  • Cost of failure – what is the cost of downtime due to under-provisioning or inability to scale on-demand. This should consider direct revenues loss, compensation payments, loss of productivity, damaged reputation and future income and so on.
  • Provisioning, deployment and operations costs - including management and monitoring. The more complex the system is, the more difficult it is to identify bottlenecks and root causes.

A Final Note on Comparing Solutions:

In the context of scalability and ROI, when we evaluate competing solutions, we need to make sure that we are comparing apples to apples.

  1. Comparing Apples-to-Apples: It is not enough to measure the license cost. We need to normalize it with the performance and scale factors. For example, if two products cost the same in terms of cost/CPU, but one performs 5 times better, then the cost of that product is, in fact, one fifth of the other
    .
  2. Total Cost of Ownership (TCO): We cannot look at the software license costs alone. We need to assess the overall cost of the system, including hardware (and other platform costs, such as OS), additional components required (such as load balancers and storage), cost of development and cost of failure. In the final analysis, free products that are not linearly scalable are going to cost much more than commercial products that are.
       
  3. End-to-End Measurement: When it comes to scalability, you are only as strong as your weakest link, so assessing the cost of scaling requires a holistic measurement. Before we compare two products we need to understand how they each play a role in achieving end-to-end scalability. Linear scalability requires an end-to-end solution. Solutions that are built from a bundle of tiers and products are likely to be non-linearly scalable, as contention is created by the integration of the tiers and products, the need to ensure consistency, different clustering models and other issues. This means that before we can even measure or compare cost, we first need to compare what it takes to reach linear scalability with each product. It might be that on a simple caching or messaging level, two products provide the same level of scalability. When, however, we need to integrate the messaging system, use a transaction manager, a database or filesystem to ensure reliability, our  end-to-end scalability is going to be significantly limited.

This post was co-authored with Geva Perry.

My Photo

Twitter Updates

    follow me on Twitter