Open Source

October 29, 2008

Need scalability? Don't forget pricing

In most discussions about scalability, we often approach the topic as a pure technical/architecture challenge, and ignore cost issues. The problem is that when we truly scale our application, and want to benefit from economies of scale, we're going to end up with scale limitations, not because of technical issues, but because of the pricing  and licensing models.

Scalable pricing

Scalable pricing means a pricing scheme that provides the benefits of economies of scale. Below are pricing models commonly used for software products and how they fit in the new dynamically-scalable world.

  • Free - while this certainly sounds like the best option (and may very well be) the customer needs to be aware of the following:
- The free license of a software product typically does not include support: not an option for most mission critical applications.
- When you do pay extra for support, you will typically be charged just like any other run-time license on a per CPU basis.
- Make sure that the company behind the product has a sustainable business model, otherwise there is a good chance that it will either die when its funding dries up or change its license model to monetize its user base. That's fine, but all it means is that it's not really a free offering in the long run, and you don't know what the pricing model will be exactly.
- In terms of total cost of ownership (TCO), free products are not necessarily the cheapest option. TCO is dependent on many factors, for example, dependency on other products (and their license costs), the need for integration and maintenance, etc. See my post, Economies of non scale, for more on the topic.
  • Subscription model - With a subscription model you pay a fixed periodical fee, typically on an annual basis for infrastructure software, and on a monthly basis for SaaS. Subscription pricing is suitable for on-demand scalability as it provides the flexibility to grow or reduce cost based on the annual use of the product.
  • Pay per use - this model is even more flexible then subscription model as it gives you higher granularity. Pay per use is provides in various forms where the usage can be a measure of CPU utilization or bandwidth utilization. Amazon for example charge per machine utilization for its EC2 services and data-utilization for its data services.
  • Perpetual license - This model is used to buy licenses in advance and pay for support separately (normally 15-20% on top of the per CPU license). This is the most commonly used model with commercial software products, however, due to the large initial investment required by this model, it doesn't fit well with on-demand environment.
  • Enterprise unlimited license - This model enables you to pay premium price in advance (based on potential future usage) and gives you the freedom to use the software without any limit. This model fits to environment where you anticipate that over a fairly short period of time the usage of the product will become wide and therefore the pay-per use or any of the other models mentioned above will become more expensive.

Which model to choose?

Each of the models has pros and cons and therefore the answer depends on your situation. Also, over time, as the situation changes, you will probably realize you need a different license model, and so it becomes equally important that the product you choose will give you the freedom to move from one model to another in the future.

GigaSpaces scalable pricing

With GigaSpaces we continuously look into ways to make our software license cost fit the on-demand world. For example, we launched a free Start-Up program that provides a totally FREE version of GigaSpaces for startups (hundreds of start-ups have already signed up for this program since we launched it last year). We also provide a Pay-Per-Use model for those running on Amazon EC2.

We felt that even though this is a fairly flexible pricing, we could do better. As of our 6.6 release, we added the option to buy our software at a yearly subscription price, and we also launched a new package called XAP Standard Edition, which is sold at a very low price of $9,500k per package (not CPU) where the package includes two servers, 4 GigaSpaces nodes and up to 50 clients or remote servers.

These changes were designed to address the needs of developers looking to start running their applications at a relatively low scale, who need the full functionality of the product, but cannot afford the full XAP price. Another principle that we kept when we designed this package is that moving from Standard to Premium edition wouldn't require any change in your architecture or code - which means that you could always scale to the premium edition just by changing the license key.
More details about the new pricing model is available here

Other references:
GigaSpaces and the Economics of Cloud Computing

Economies of Non-Scale

October 04, 2008

Is MapReduce going mainstream?

I'm getting a lot of questions lately about the use of MapReduce: how it compares with other technologies such as Grid, and how the the different solutions that claims support for MapReduce (GigaSpaces included) fit into the puzzle. A good starting point is the intense discussion on the cloud computing mailing list under the topic: "Is Map/Reduce going mainstream?" where I contributed some of my own thoughts on the topic.

To summarize the questions on this topic, I'd state it as follows:

  1. How can we reduce the barrier-to-entry for implementing MapRreduce specifically, and parallel processing applications in general?

  2. Many of the use cases for MapReduce represent some sort of data analytic application. But can MapReduce be used as a generic parallel processing mechanism? Specifically, is it suited to deal with issues such as data affinity, asynchronous batch processing, etc.?

In this post I'll try to answer these questions, but first, a few clarifications:

What is MapReduce?

Quoting from the Wikipedia definition

MapReduce is a software framework introduced by Google to support parallel computations over large (multiple petabyte[1]) data sets on clusters of computers.

Why do we need a new model for processing large data sets?

Unlike central data-sources, such as a database, you can't assume that all the data resides in one central place, and therefore, you can't just execute a query and expect to get the result as a synchronous operation. Instead, you need to execute the query on each data-source, gather the results and perform a 2nd-level aggregation of all the results. To speed the time it takes to run this entire process, the query needs to be done in parallel on each data source. The process of mapping a request from the originator to the data source is called "Map"; and the process of aggregating the results into a consolidated result is called "Reduce".

MapReduce implementations

Hadoop is the most well-known MapReduce implementation.

Hadoop is an open source project that implements the exact spec defined by Google in Java. As such, it was designed primarily to enable MapReduce operations on distributed file systems and was not really designed as a general purpose parallel processing mechanism.

The wikipedia entry on MapReuce (http://en.wikipedia.org/wiki/MapReduce) has references to other implementations in other languages, including Greenplum, Skynet, and Disco.

Other forms of MapReduce implementations

Over time, the term MapReduce has expanded in definition to describe a more general purpose pattern for executing parallel aggregation of distributed data-sources, rather than referring to a specific type of implementation. GigaSpaces, GridGain, and to a degree, Terracotta, all took a different approach than Hadoop in their MapReduce implementations. Rather than implementing the exact Google spec in Java, these three aimed to take advantage of the Java language and make the implementation of the MapReduce pattern simpler to the average programmer (I'll get back to that later).


How MapReduce differs from other grid implementations?

Compute Grid

While MapReduce represents one form of parallel processing for aggregating data from distributed data sets, it is not the only one. "Compute Grid" is a term used to define another form of parallel processing, used mostly to compute intensive batch processing. A typical batch processing takes a long-running Job, breaks it into small tasks and enable the execution of those tasks in parallel to reduce the time it takes to execute the job (Compared with the time it would have taken to execute the tasks sequentially). This model is a good fit for executing relatively compute-intensive and stateless jobs. A typical scenario for this would be a Monte Carlo simulation, such as the one used to perform risk analysis reports in the financial industry. This type of analysis is more compute-intensive than data-intensive. Most compute-grid implementations have the following components:

  1. Scheduler

  2. Job executor

  3. Compute agent

The executor submits jobs. The scheduler is responsible for taking the job, splitting it into a set of small tasks (this process requires specific application code) that are sent in parallel based on a certain policy to a set of compute nodes. The agents on each compute node execute those tasks. The results of those tasks are aggregated back to the scheduler.

The scheduler is responsible for monitoring and ensuring the execution of the tasks. The scheduler was designed to support advanced execution policies, such as priority-based execution as well as advanced life-cycle management.

Master/Worker Pattern (simple Compute Grid)

The Master/Worker pattern is a simplified version of parallel batch execution, based on the Tuple Space model. Tuple Spaces emerged from the Linda project at Yale university. JavaSpaces is the main Java implementation of the model. A good description of this model is provided in this article. In a master/worker pattern, tasks are assumed to be evenly distributed across worker machines. In this case there is no need for an intermediate scheduler. Load-balancing is achieved through a polling mechanism. Each worker polls the tasks and executes them when it's ready. If a worker is busy, it simply won't process the tasks, and if it is free it will poll the pending tasks and process them. Consequently, Workers running on a more powerful machine will process more tasks over time. In this way, load balancing is implicit, supporting simple task distribution models. For this reason, master/worker implementations tend to be more useful for simple compute-grid applications.The fact that there is no need for an explicit scheduler makes master/worker more performant and better suited for cases where latency is an important factor.

MapReduce & Compute Grid: Summary

Although both Map/Reduce and Compute Grids provide a parallel processing model for solving a large- scale problems, they are each optimized for addressing a different kind of problem. MapReduce was designed to address shortening the time it takes to process complex data-anlytics scenarios. The results of the processing need to be returned in real-time, as the originator of the task normally blocks until its completion. Compute Grid applications are aimed at speeding-up the time it takes to process complex computetional tasks. the Job is executed as a background process that can often run for a few hours. Users don't typically wait for the results of these tasks, but are either notified or poll for the results. With MapReduce, the application tends to be data-intesive, therefore scalability is driven mostly by the ability to scale the data through paritioning. Executing the tasks close to the data becomes critical in this scenario. Compute Grid applications tend to be stateless, and normally operate on relatively small data-sets (compared with those of MapReduce). Consequently, data affinity is considered an optimization rather than a necessity.

When to use MapReduce, Compute Grids and Master/Worker?

  • If you need to agregate data that resides in a distributed file system then I would recommend the use of Hadoop and the like.

  • If you need to agregate data that resides in other data sources, such as an in-memory data-grid (IMDG), you should consider GigaSpaces, or a combination of compute grid and data grid products.

  • If your application is compute-intensive and relatively stateless in nature â€" you should consider the classic Compute Grid implementations.

  • If you're looking for a real-time (or near-real-time) and lightweight compute-intensive application, you should consider Master/Worker implementations

In reality, most compute-intensive application are not purely stateless. To execute the tasks the compute tasks need to process data that is coming from either a database or a file system. In small scale applications, it is common practice to distribute the data with the job itself. In large scale compute-grid applications, however, passing the data with the job can be quite inefficient. In such cases, it is recommended to use a combination of Compute and Data Grid. In this case, the data is stored in a shared data-grid cluster and passed by reference to the compute task. So we see the need for a combination of Compute and Data Grids becoming more common.

Too many options? Feeling confused?

At this point you may be scratching your head wondering whether or not your application falls precisely in any of the above categories.

A quick reality check will reveal that many existing applications consist of a variety of the above scenarios, mixed with traditional client-server models.

In such cases, attempting to use a different product for each scenario in our application is going to make things extremely complex.

How do we make distributed programming like MapReduce simple?

This question has been the driving force for many of our recent development efforts.

To simplify things, we realized that we need to:

  1. Grid enable existing programming models -- Use abstraction and virtualization techniques to introduce parallel processing as part of a normal client/server programming model.

  2. Reduce the amount of frameworks -- Provide a common model for using both parallel computing models: batch (compute-intensive) and real-time aggregation (data-intensive).
  3. Make data-awareness implicit with all APIs -- In reality, most application are stateful to some degree, so we need to make data-awareness implicit within our API and not as an afterthought. External integration solutions tend lead to complexity.

Where does GigaSpaces fit in?

GigaSpaces emerged from the tuple space model, specifically JavaSpaces, and was one of the first implementations of the Master/Worker pattern. At a later stage, we extended our JavaSpaces implementation to a full IMDG (In-Memory Data-Grid). In large scale compute grid applications, the GigaSpaces Data-Grid is often used in conjunction with other Compute-Grid implementations, either commercial or open source. This puts GigaSpaeces in a unique place, providing data-grid and data-aware compute grid capabilities using the same architecture. We also provide built-in integration of our Data-Grid with more advanced Enterprise Compute Grid products, such as those from DataSynapse and Platform Computing.

As of version 6.0, we offered abstraction layers (referred to as the Service Virtualization Framework or SVF) that take advantage of our existing space-based implementation in a way that doesn't require a complete re-write or a steep learning curve for developers who have already written their business logic as SessionBeans, Spring Remoting, RMI, CORBA, SOAP and other common Client/Server programming models. Our aim was to make distributed programming simple to the average programmer. We achieved this goal by following the same principles that I laid out above. For example, we introduced a set of abstractions on top of our space-based implementation. As we support both data distribution and task distribution, we are able to reduce the number of required frameworks and runtime components, as well as avoid the need for external services to ensure data affinity. In addition, we extended our support for aggregated MapReduce queries using a new Executor framework. With this we can support MapReduce and batch processing using the Master/Worker pattern and the *same* consolidated programming model.

The idea behind all this is to make scale-out development simple by making the API as close as possible to prevailing programming models, and by reducing the number of products and components required to scale either data-intensive or compute-intensive applications.

Final notes

The emergence of MapReduce specifically, and Grid computing in general, creates a need for another type of programming model currently missing in most existing mainstream frameworks and products. So far the solution has been to provide different specialized frameworks to to address each need. The fact that we have so many different frameworks (MapReduce included) makes things more complex.

On the Cloud Computing mailing list, Chris K Wensel wrote the following comment:

'thinking' in MapReduce sucks. If you've ever read "How to write parallel programs" by Nicholas Carriero and David Gelernter (http://www.lindaspaces.com/book/), many of their thought experiments and examples are based on a house building analogy. That is, how would you build a house in X model or Y model. These examples work because the models they present are straightforward.......If companies like Greenplum are using MapReduce as an underlying compute model, they must offer up a higher level abstraction that users and developers can reason in.

Indeed Making MapReduce part of mainstream development requires a higher level abstraction. The high level abstraction needs to provide means to use existing programming models on top of MapReduce to shorten the learning curve and transition from existing applications to distributed scale-out applications. Having said that, this is not enough, as we're still going to end up with multiple frameworks for addressing various parallel programming models that are not covered with MapReduce, such as Compute Grids and batch processing. It is therefore critical to map those different models into a coherent and consistent model that would support all various programming semantics, including MapReduce, Master/Worker and batch processing, in addition to the classic Client/Server model, with the ability to smoothly transition among them, without the need to switch or integrate different frameworks for each, and without the need to write our business logic in a completely differently way for each.

August 18, 2008

GigaSpaces XAP 6.5/6.6 new releases

GigaSpaces 6.5 was released at the end of June, and we are now working on the 6.6 release, with the first milestone already publicly available. These are major milestones in a series of upcoming releases all aimed at strengthening our proposition as a Scale-Out App Server. Our main goal is to significantly simplify the process of achieving scalability, including scaling an EXISTING application within days and without enforcing a complete re-architecture.

I refer to this as our "Seamless Scaling" or "Simple Scaling" initiative. You can read some of the rationale behind this initiative in my previous post Can scalability be made seamless. This is a very ambitious goal, and by all means, we are not finished. In addition to the tremendous enhancements already put in place, we have a long-term roadmap that covers many aspects of the product.

The efforts we have undertaken (as well as those on our roadmap) involve enhancements to our development frameworks in Java, .Net and C++, mostly around the abstraction layer, including supporting standard APIs that enable us to inject many of our Data Grid and event-driven capabilities through annotations and configuration, with zero or minimal changes to application code.

They also involve increasing robustness, making large-scale deployments simpler to deploy and manage. Other efforts include extensive integration with popular frameworks, such as Spring and Mule, and recently we also added Web framework integration with Jetty. All this is designed to make the end-to-end scaling experience extremely simple and native. Judging by recent feedback we received, some of it publicly referenceable, it looks like we're making great strides in achieving our goal. I particularly liked the following quote by one of our existing customers Monte Paschi Group, who built a new pricing system with GigaSpaces. Their full case study is available  here. I chose the following quote from this study as it highlights some of the benefits that we don't always emphasize enough - development simplicity:

The development team is happy, too,  since the architecture has been
greatly simplified compared to the multi-layered application server system.
"We're not a software company, we're a financial company," Santini
explains. "We didn't have weeks or months to study the technology. Our
main goal was to use it to achieve our goals. GigaSpaces XAP allowed us
to do that right out of the box.

You can read the full details of the 6.5 release here. For convenience, we grouped separately the long list of features into Java,.Net and C++ categories, and provided detailed descriptions that outline the rationale and the value behind each feature.

In this post I'll try to highlight some of the important features and provide insight into our future plan.

Seamless scaling using the Service Virtualization Framework

At the application layer the most notable feature is the Service Virtualization Framework. The Service Virtualization Framework (SVF) can be seen as a major enhancement to Session Beans in EJB3 and Spring Remoting. This  framework enables you to write your business logic as a POJO and deploy the services across a cluster of machines, while providing a single client proxy that virtualizes all those instances as if they were a single server. For more details I recommend reading a new white paper covering the concept behind this framework and how it can help you simply build scalable, high-performance SOA and event-driven applications. The white paper is available here

Seamless scaling of popular development frameworks

We enhanced and expanded our integration with popular development frameworks. The purpose of the integration is to provide end-to-end seamless scaling to such frameworks in a way that doesn't require changes to the application. A good example is our Mule-ESB support. Mule users can take existing Mule 2.0 applications and significantly improve performance and scalability by plugging in the GigaSpaces runtime into the Mule Framework. The good news is that the wiring happens outside of your application code at a couple of levels:

  1. Connector level – leveraging our messaging layer as the transport for Mule
  2. Clustering level – taking advantage of our clustering capabilities, enabling the internal Mule data structure to span across multiple machines for scalability and high-availability

This integration is provided as part of our open source framework, OpenSpaces. We hope and anticipate that these integrations will be used as a reference by other frameworks looking for ways to provide similar levels of scalability and reliability. A good example for that already happening is the work David Greco performed by integrating the Camel open source ESB with GigaSpaces.

With 6.6 we also added out-of-the-box integration to Jetty. This was done in collaboration with the Webtide team (the company behind Jetty), who have given us excellent support throughout the process. What i like about this integration is that it enables taking an EXISTING web application packaged as a WAR and dynamically scaling it across a pool of machines. With this approach, you also get session-replication injected to your existing application without touching your code or WAR package. If you're willing to make slight configuration changes, you can get caching reference injected into your application. There is a new example that shows what it takes to scale an existing web application. The example uses the Spring Pet Clinic application and deploys it on a GigaSpaces cluster. The full example is available here.

Removing the language barrier

For decades language have been treated almost as a religion by developers. As an ex-CORBA guy, I know how much language interoperability is painful to deal with, and often requires compromises on functionality, performance and complexity. At GigaSpaces, we realized that there is no reason that different languages shouldn't be treated simply as forms of writing business logic. They each generate different values. For example, .Net provides better integration with Windows applications. C++ provides better performance in certain areas and provides low level APIs for integration with many third-party libraries.

Persistence: 6.5 has some major enhancements for implementing our Persistence-as-a-Service model in .Net through support of nHibernate. This enables .Net applications to integrate with existing databases and have their own database mapping layer with a native .Net API. This comes along with quite extensive Perforamance improvements. You can see some of the results here for .Net, and here for C++.

One of our goals for 6.5 was to bring the same level of scalability and simplicity we provide for Java to .Net and to C++, without compromising performance. Unlike some of the alternatives in the market, we don't just provide remote access to our Java runtime, but provide complete application server capabilities in these two languages -- as well as complete interoperability among all three. Java, C++ and .Net services and clients can run and share the same process and leverage that to remove the network call overhead often required when a call crosses language boundaries. An immediate benefit is that you're able to run your C++ and .Net business logic where the data is. Furthermore, you can now leverage our existing SLA-driven deployment to automate the deployment of Java and .Net applications. This means that instead of running each server process manually, you have a single deployment command that will make sure that your serves are running on the appropriate machines, that your backups are running on different hosts from your primaries, that if one machine goes down a new instance will take over immediately, or if one is not available, as soon as it becomes available -- all that without any human intervention!

Dynamic language support

The Java framework guys realized the need to support dynamic language as part of Java, making the JVM a common platform for running various languages. GigaSpaces XAP 6.5 leverages this, and provides enhanced support for Groovy, JRuby and JavaScript.

Dynamic language support enables writing business logic in Groovy/Jruby/JavaScript and executing it on the GigaSpaces cluster. One of the common use cases for this capability is to provide an elegant alternative to Stored Procedure. this means you can write business logic in Groovy, for example, that will be executed directly on the data grid nodes. With this, you can write your own custom data-queries and aggregation functions, and execute them where the data is. Beyond the performance benefit that you gain out of running the logic collocated with the data, you gain the benefit of using dynamic languages, i.e., you can add new functions on the fly without the need to deal with class-versions and class-loading issues and without the need to bring the data down whenever you do that. In this way you can add new functions while the system is running and continues to serve other applications.
This feature leverages the SVF mentioned above. This means that you can choose to run these dynamic procedures synchronously, asynchronously, in parallel, etc. Now Isn't that cool?

Click here for code snippets and detailed descriptions of this feature.

Data awareness everywhere
Throughout all of our development efforts, we are making sure data-awareness is maintained across the entire stack. Data-awareness means that invoking a method on the new Service Virtualization Framework can be routed to a particular service instance based on the data associated with that service instance. It also means that when you send a message through our JMS implementation, you we will be able to route it to the JMS partition that manages the relevant data. Unlike alternative solutions, this is native to our environment, meaning that there is no need for external integration and complexity to achieve this behavior.

Click here to view a code snippet and detailed description of how routing is handled in the Service Virtualiztion Framework.

Performance, Performance and more Performance...

Improving performance remains a constant goal for all of our releases. As the product matures, finding places in the product where performance can be further optimized is getting harder, and I therefore am always surprised when one of the developers comes up with some creative idea around performance.

In this release we improved performance on several fronts -- including Java, .Net and C++ -- which involved significant optimizations of object serilization and multi-core scalability. For the latter we are working with Azul, and making it part of our testing environment, as well as other multi-core systems such as Sun Niagra. You can see some of the figures and details here, here (Comparison with previous release of .Net) and here(C++).

We conducted detailed comparisons of latency and throughput of a "classic" transactional application based on the standard JEE model (Using JBoss ,JMS, Spring, Hibernate) with the same application but using GigaSpaces as the messaging and data-layer -- and eventually replaced the entire JEE stack with a GigaSpaces + Spring stack. It is important to note that throughout this process, the business logic code remained untouched. The initial results of this tests can be found here:

Latency1jpg_version1Throughput1jpg_version1

You can find the details of the code that was used in this test and the migration steps in a new whitepaper that is now available on our site here.  Uri Cohen wrote up in his blog (The Space as a Messaging Backbone) some of the interesting findings from this analysis that showed the difference between end-to-end measurment and point optimization and why in some cases putting a distributed cache in front of a database is not going to be enough.

What's next:

For obvious reasons I can't expose our entire roadmap as of yet. What I can say for sure is that we're going to continue improving the level of seamless and simple scalabaility provided by our platform.

We view the partnership and integration with other frameworks as strategic, and we're going to continue with that effort. One of the frameworks we are planning on working on is GlassFish.

We already announced our first cloud offering, designed to run on Amazon EC2, and including partnerships and integration with RightScale and Cohesive FT. If I'm not mistaken, this was the first Java application server available in a pay-per-use model, designed to meet the needs of enterprises and ISVs that want to offer their applications on the cloud, including as Software-as-a-Service. We're going to put in a lof of effort into making cloud deployments simpler, enabling our customers to use it on their local virtualized environments (private clouds) and on public clouds (Amazon EC2, GoGrid, FlexiScale, AppNexus and others), or even a combination of the two, without changing their applications. We're now working on a new version that enables provisioning a cluster of machines, deploying the application on said cluster and opening up an adminstrative console for the cluster -- all with a single click. This is already working in an internal beta. We're planning to provide a preview release by next month. With the availability of Windows-based clouds, we will be providing our .Net application platform as a cloud offering as well.

On the API and Standards front, we recently joined the OSGI alliance, where we expect to play an active role.  We are also looking into ways through which we can strengthen our compliance with some of the latest standards on the JEE stack, such as EJB 3.0 and JPA. The challenge is not just basic API mapping, but how to do it in a way that doesn't break our scale-out architecture and doesn’t create complexity. Unfortunately, previous versions of the EJB spec weren't a good fit. EJB 3.0 looks much more promising.

On the .Net front, we're going to continue with our performance optimization project. We're also working on making our .Net offering fit natively within a .Net development environment by providing better development and installation packages that fit better with the .Net spirit. We are also looking into ways to simplify the testing and debugging process. For pure .Net users we will make the .Net version available as a standalone package at a reduced price (details will follow).

On the C++ front, we're going to provide our customers with an open source version of our C++ binding and a complete package that will enable them to compile and build our C++ with their own set of dependencies, libraries and compiler versions and flags. This will also allow using the current C++ framework as a broad integration framework for third-party tools and languages.

There's much more than I could cover in this post. I tried to put together what I thought are the highlights of the release. As it's impractical to cover such a wide spectrum of topics in a single post, we started a process in which different people from our R&D and field engineering teams will post on specific aspects of the product and best-practices for using GigaSpaces in existing web, financial, online gaming and other applications.


Be part of our next release:

As we are now making the decisions of what to include in our 7.0 release, it would be nice to hear your feedback and specific requests for enhancements or new features. You can either send me a direct email or send it to PM at GigaSpaces.com Alternatively, if you think that you have a good idea that other users might be interested in, you can implement it on our community site – OpenSpaces.org.

The new GigaSpaces XAP 6.5 is available for download here.

June 27, 2008

TSSJS Prague: my take-aways

Once again TSSJS was a well-organized event with lots of interesting content. Hot topics that I took  notice of were RIA, new languages, and obviously distributed computing and scalability.

I arrived on Tuesday morning, which gave me a chance to meet John Davies, Ted Neward, Kirk Peperdine and Holly Commins. We found a nice spot not to far from Charles Birdge. At some point we started discussing the reasons we're seeing a burst of new languages. The discussion about languages is thought-provoking. Ted Neward (One of my favorite presenters) seems to be spending a lot of his time recently thinking about this topic. He explained over dinner (while he was completely jet-lagged!) his view. I'll try and summarize the main points:

  1. One size doesn't fit all - we shouldn't try to force one language to do everything and expect it to be good at it all. The concept of using multiple languages in the same application is actually something we've been practicing for a while by using a combination of HTML, CSS, XML, JavaScript, Java, etc. Each language serves a specific purpose.

  2. Different semantics require different expressions, i.e., different languages. An example that was given was Scala and Erlag and the notion of parallel programming as a first class-citizen in the language (as opposed to a set of libraries and explicit APIs in Java). The argument (brought up by Kirk) is that you can't leverage multi-core platforms without languages that were designed to do so.  It reminds me that  indeed multi-threaded programming  wasn't  common until threading became native in the languages. Now you can't think of writing even a simple application without threads. So i think that Ted and Kirk have a valid point.

  3. Usability and productivity - how many lines of code are required to express a certain idea? There are many examples that show how different use cases in Java could be relatively verbose and complex with comparison to the "new" languages.

  4. JVM/CLR makes it easy to introduce new languages as just new views and perspectives running on the same platform. Previously, languages such as Perl and TCL had to be built with an entire stack, typically based on C or C++, and had to be ported to various platforms and operating systems. This approach made the choice of language and language interoperability quite difficult as the decision to choose one languages over the other was considered a "catholic marriage". Today, the JVM in Java and the CLR in .Net enable better separation of concerns. They provide a common platform that can easily support multiple languages. This simplifies interoperability of different languages within the same application. A good example is the new support for dynamic languages in Java 6 and in .Net. This makes the language decision simpler, as the impact of this decision on our project is less drastic and less risky than it was before.

While I think that all of the points are valid I couldn't avoid thinking that we're forgetting past experience. For example, you could easily argue that lines-of-code is only one measurement of productivity. Another measure of productivity is maintenance, i.e., how simple it is to read the code and understand it, transfer it to another programmer, etc. My concern is that if the language becomes too flexible and enables each of us to write our own extension, we're going to find ourselves in a position where the only person that understands the code is the person who wrote it, and even that would hold true only for a certain period of time. Think about C++ templates, macros, operator overloading, multiple inheritance  -- a lot of "nice" features that made our code very flexible, but less readable due to the large number of indirection we had to go through to parse a single class.

One of the things I liked when I switched to Java from C++ was the fact that to understand my colleagues' code all I needed was to read their .java files. In most cases I didn't really need documentation and it was fairly simple to parse the code because Java restricted much of the flexibility that I just mentioned. Trying to do the exact same thing with C++ requires parsing of header files, macros and typedefs. Another issue is that introducing multiple languages can be quite complex and a barrier to productivity, due to limited skill sets within a certain project, even if choosing a language is less risky than before.

I think the concurrency argument is only a temporary one. I'd hate to choose a different language just for that, because it's something that I expect to see native in Java. So far we managed to deal with multi-core and parallel programming quite effectively with Java using event-driven architecture (EDA) and master/worker patterns, and abstract a lot of the concurrent programming with things such as Futures, Remoting, etc. Surely having some of the features of Scala or Erlang as part of Java would have made our life simpler, but if I measure the value vs. risk involved, I'm not sure it justifies using it right now in a real project.

Don't get me wrong. I'm not saying that there is anything wrong with these languages. What I'm arguing is that we need to be very careful before we choose them and make sure that we're measuring the right value, rather than assuming that any of the above arguments applies to our application without proper analysis. Ted was able to convince me to look further into this topic - so I'm probably going to give Scala a try and get a real feel of it.

The event started on Wednesday morning  with a very good presentation by Stephan Janssen. Stephan is the founder of Parleys.com. He is also the founder of the JavaPolis conference held annually in Belgium. He talked about his experience with a wide range of RIA platforms: DHTML, Adobe Flex/Air, JavaFX, Google Web Toolkit (GWT) and Microsoft Silverlight. He discussed his personal experience in using the various technologies as part of Parleys.com.

The combination of a general overview with real-life examples made the discussion quite interesting and lively. The bottom line of this part of the talk was to use Adobe Flex, if you're building a site in the short-term and JavaFX if you're planning on launching your site in about a year's time - due to the maturity cycle and the gaps between the two technologies. Personally I found the fact that there are so many options to do the same thing quite confusing. I wish that we could press the fast-forward button on the maturity cycle of these technologies. Working with previous versions of Parleys.com I must say that i was very impressed with the progress and the *right* use of technologies to build the new version of the site.

Another interesting and quite innovative idea that Stephan presented was about hosting services and collaboration with academic partners. The hosting service will enable companies like ours to host their live presentations in the Parleys.com site. In addition, you can embed the presentation in a blog entry. You can also record your talk online, using a web-based application. The partnership with academic institutions enables scaling not just the content, but also the bandwidth, similar to the way downloads works. IMO Parleys.com could easily become the YouTube of online presentations. If you missed the presentation I'd recommend watching this interview here:

My own presentation, Getting ready for the cloud, seems to have been well-received, although I had some concern that it might be too high-level for some of the audience. You can read some of the comments posted by others here and here.

The presentation included a live demo of a web 2.0 application (displaying market data) running live on Amazon EC2. Although the demo ran over a wireless line, it went surprisingly smooth and I was able to easily redeploy and relocate instances through a simple drag & drop using our UI, which was hosted on one of the EC2 machines. The following day, Uri Cohen gave a session in which he showed the details of what's going on behind the scenes and reviewed the actual code and API used in the demo. If you're interested in experiencing it yourself, you can try out the same demo on our new EC2 version

TSSJS was a good opportunity to meet in person the winners of our OpenSpaces developer competition. I heard interesting stories about what drove them to write their projects. The common theme was the technology challenge - they heard about our technology and scaling pattern and wanted to get a feel for themselves of how it works. You can listen to some of the stories in recent podcasts we published here here and here.

BTW, Jason Carreira, one of the winners, has since worked on another project: a scalable Twitter-like application using GigaSpaces and EC2 (an alpha version already exists, he is now looking for hosting opportunities). And Leonardo Goncalves -- the first prize winner -- is already thinking of the next version of his project. The third winner - Kirill Ishanov -- is also planning to participate in next year's contest. At the end of the first day we showed a video of some of the judges (John Davies and Jullian Browne were missing from the video). It’s a light-hearted video in which the judges also makes fun of Joe Ottinger:)

   
Two of the talks I very much enjoyed were given by John Davies, formerly the founder and CTO of C24 (which was sold to IONA), who has recently started a new venture called Incept5. I've worked with John for many years now and we often have excellent chats about ideas in our respective markets. John's first talk was one I'd heard before, but as always, he updated it with new anecdotes and ideas. He talked about extreme enterprise architectures, specifically ESBs and grid in the low-latency, high-volume, complex envrionments of investment banking. John started by explaining the value of a millisecond to the high-end institutions, literally in terms of dollars, something like $100 million per ms. He went on to talk about compiled languages compared to Java for this sort of processing. It was interesting to see John walk through a very high performance matching and reconciliation engine we had designed together for a client a few years ago and it's exciting to hear that his new company will specialize in this is. John talked about some of the clever coding patterns that had to be implemented to provide linear scalability, and although master-worker was the pattern of choice for scaling, it wasn't as simple as just writing lots of workers.

John's second talk was new to me, and although we discussed the ideas in it in the past, it was fascinating to hear them presented. The room was packed -- standing room only -- as it was a topic near and dear to the hearts of many developers and architects: "The Enterprise without a Database". I thought this would just be an extension of caching, but John went on to emphasize the huge amounts of time and energy (human) being lost on Object-Relational Mapping (ORM). Why do we still persist our well-established Object-Oriented models into a relational database? While ORM is simple at the example level, it doesn't scale given the levels of complexity in today's messaging standards. John made this very clear by example. I got the feeling that he was holding back the solution, perhaps to be released by his new company, but it was clear that there are alternatives to ORM: from caching objects to using CLOBs in classic databases. This is obviously an area to watch, as John always has a good vision for these sort of things.

At the end of the day I had the chance to have a beer with different people in a nice Mexican restaurant in Prague (courtesy of Jodie, the cameraman, and his local friend). After a few beers, mojitos and lots of peanuts (courtesy of John Davis:)), the topic of open source software (OSS) came up. I think that we all agreed that being open is a no-brainer, and that's the way software products should be built. Being open doesn't necessarily mean free - take Jive and Atlassian, for example. They sell commercial products, but they provide customers with the source code

Another model is the dual-license model such as that used by Red Hat and MySQL. It's sometimes referred to as the Fedora model. It means that you have a choice to use a free version but if you do, you're on your own. If you choose to use the supported version, you're going to be charged a subscription, for which you'll get extra features and better packaging/documentation.

I argued that it is important to have a solid business model behind a product/project. It should be as important to the users as to the company developing the product. if a product doesn't have a solid business model two things might happen: the project/product is going to be abandoned at some point due to lack of funding, or the owner of the product will change the licensing model to monetize on the IP and established user-base. We've seen both scenarios happening already.

I also argued that the Fedora model is usually successful only as part of a commoditization strategy. For example, JBoss's strategy was to go after the lower end of WebLogic/WebSphere accounts. The same applies to MySQL. This strategy seems to only work to a certain limit. I argued that this model is not proven in an emerging product category, where large investments in market education and innovation are required to achieve massive and sustainable adoption. In such cases, the Jive/Confluence model seems to be a better fit. Anyway, this topic is worth a separate discussion, so I'll leave it at that for now.

Unfortunately I had to leave on Friday (to be at my daughter's end-of-year party at school), so I missed Shay Banon's presentation. Based on what I heard it went very well.  You can view Shay's presentation here.

Anyway, it was a real fun event and i look forward to next year.

June 05, 2008

Economies of Non-Scale

Scalability forces us to think differently. What worked on a small scale doesn't always work on a large scale  -- and costs are no different.

To measure the cost impact of scaling, let's look at the amount of resources required to scale to a given level. We'll use Amdahl's Law as a method to measure the amount of required CPUs. This will provide us a proxy for hardware and software costs. Later on we'll also review other costs related to the process of scaling. What's going to come up clearly in this analysis is that the cost of scalability grows exponentially in non-linearly scalable applications.

I'll also argue that scaling is not just a technical issue: it has a direct impact on our business and its ability to effectively compete.

Measuring the Impact of Scalability on Hardware and Software Costs

The following analysis is based on Amdahl's Law, which is described well in this Wikipedia entry. Put briefly, the negative impact of the non-scalable portion of our application grows exponentially relative to the scaling requirement, up to the point in which adding resources will not improve performance/throughput, as seen in this diagram:

 

Amdhal_law_2

By non-scalable I mean the part that is serial (non-parallelizable) -- or to use  terminology that is more relevant: the level of contention in the application. Contention can be thought of as the percentage of time operations wait on things such as shared table-locks in the database, persistent queues or distributed transactions.

The diagram above shows that if 90% of our application is free of contention, and only 10% is spent on a shared resources, we will need to grow our compute resources by a factor of 100 to scale by a factor of 10! Another important thing to note is that 10x, in this case, is the limit of our ability to scale, even if more resources are added.

Now let's see what it will cost us -- in terms of CPUs and software licenses -- to increase our scalability by a factor of 10, assuming that we have only 10% contention (and that's a fairly optimistic scenario with prevalent tier-based architectures).

 

Costofscaling_4

Take-Aways

There are two key take-aways from this analysis:   

  1. The cost of non-linearly scalable applications grows exponentially with the demand for more scale.
  2. Non-linearly scalable applications have an absolute limit of scalability. According to Amdhal's Law, with 10% contention, the maximum scaling limit is 10. With 40% contention, our maximum scaling limit is 2.5 - no matter how many hardware resources we will throw at the problem

An Example

The following is inspired by numerous true stories I have seen at GigaSpaces customers.

A team is tasked with building an online order management application that needs to process 1,000 orders per second. They choose a typical n-tier architecture with a web server as the front-end and a database as the data-tier. Note: In this case I give the web-tier as the front-end. In some systems the front-end is actually a messaging system. While the implementation is rather different, the behavior for the purpose of this discussion is the same, as both represent different forms of feeds or service requests.

Let's assume that the team designed the architecture by the book to ensure 100% reliability and consistency. This means that every critical transaction is stored in the database.

They now found that a single web server can handle only 200 transaction/sec, so they decide to put a load-balancer on the front-end, and deploy 5 web servers to meet the 1000 orders/sec requirement. At this point they realize that despite the fact that they increased the amount of servers by 5, the total number of orders/sec didn't really increase by much -- and only reached about 400 orders/second.

They start to monitor the system, and find that the application spent 40% of its time on shared locks in the database. As we already saw above, with 40% contention, we can only increase the throughput by a factor of 2.5 -- or 500 orders/second, so no matter how many web servers are added, the application will never be able to meet its throughput goal.

So the team decides to reduce the contention by placing a distributed cache in front of the database, which will reduce the hits on the database. They are cost-conscious so they select a free product -- memcached -- as the caching solution. As memcached cannot serve as the system of record (it doesn't support transactions, queries, is not highly-available, etc.), adding it reduces the contention,  but does not eliminate it.

For this example, let's take an extremely optimistic scenario and assume that memcached reduces the contention from 40% to 10%. According to Amdhal's Law, increasing throughput by a factor of 5 with 10% contention, will require 10 servers. Now the team is happy! 

Just as they get the system to work as expected, the boss knocks on the door and says: "Hey,  we're going to launch a major campaign for the holiday season, and marketing anticipates double the number of visitors on our site. If we're very successful, we might even triple the normal traffic. Can we support this?".

The team is in initial shock. "Double? Triple? Are you kidding me!? We just worked our ass off to get this application to work for the existing load, and now he's talking about doubling and tripling the load as if it were trivial?" The team now faces the prospect of explaining to management that they don't know how to achieve this capacity, or even worse, that they are incapable of it (while their competitor has already achieved it last year). Let's see: double throughput means 2,000 orders/sec. So they will need to increase single server throughput by a factor of 10.  According to Amdhal's Law they will need grow from 10 servers to 100 servers! Tripling the capacity is not even an option. The system already reached its maximum limit when it grew by a factor of 10.

And there's one more thing the team needs to tell the boss...

Ahem, regarding the budget...

For the sake of discussion, let's assume that the team was using free products to reduce software license costs. To meet their initial goal of 1,000 orders/second they had to use 10 machines for web servers and another one for the database. If each machine costs $10,000, total costs are $100k for the web-tier and another $10k for the data-tier, so we're talking $110k total cost of hardware.

In addition, the team spent a substantial amount of time on iterations to analyze the bottleneck and find a solution, wire the pieces together, make sure everything works and modify code. Let's say this development effort took 5 team members 3 months to complete, so it cost about $150k (again, I'm being optimistic). Total costs are at $260K.

As we've already seen, to double capacity, they will need 100 servers. Let's also assume that it will require a similar effort to develop, tune and test it  -- so $900k in additional hardware and another $150K for development. Total costs are now at $1.31 million.

Now off to triple the load... wait, we can't. The team now literally has to go back to the drawing board (a whiteboard in this case) and completely re-design and re-write the application -- if they can even figure out how to do it. Can we even measure the cost of such a scenario? We can probably measure the development cost, but what is much harder to measure is the amount of money the company is going to lose because it can't support the tripling of load.

It would be a false assumption that the company controls the situation, and has time to plan everything in advance. What if Mylie Cyrus shows up on the cover of Seventeen holding the super-cool gizmo the company sells and now all of a sudden there's a mad rush to buy it on the site? We need to assume that we can't predict the load. It doesn't happen when we plan for it. It can happen when we least expect it.

With social networks or electronic trading and e-commerce, such events can quickly lead to a viral effect or what's also known as an 'event storm'. Such a chain of events can quickly lead to disaster, such as a site crash, and the company is now all over the news and the blogosphere (and not in a good way), customers are frustrated and many defect to the competition. The trouble is that they can't fix it in a snap. Remember: they need a few months heads up to re-write the application.

Lessons Learned

There are a few things we can learn from this story:

  • direct costs of scaling grow exponentially with the demand for scale;
  • the cost of software licenses can be a very small factor in total cost;
  • indirect costs, resulting from the unpredictability of our system and the inflexibility that it imposes on our business, can have huge implications to the business, well beyond any measurable criteria. Even if your company can measure the direct losses from downtime due to lost sales or trades, you'll be hard-pressed to measure damaged reputation, loss of customer trust, and in some cases, the loss of your job!

A Better Approach: High-Throughput and Linear Scalability

Things don't have to be the way I described above -- and for a growing number of companies -- they aren't. In the blog post Twitter as a scalability case study, I detailed the principles that companies such as Google and Amazon follow, as well as those who use GigaSpaces, so I won't repeat them here.

Following those principles, we use a memory-centric solution that addresses architecture holistically (end-to-end) and is linearly scalable. Because it is in memory, it performs much better. GigaSpaces customers have seen throughput improvements of between 10x and 100x, depending on the existing architecture and the scenario. Let's say -- again, being very conservative -- that throughput with this approach is only 5x: 1000 orders/sec per server. This would mean that to meet the goal of 1,000 orders/sec we only need 1 machine (compared to 5). Even if the software cost of this solution is $20k/CPU, thanks to the reduction in hardware costs, we will end up with significant cost savings.

Perhaps more importantly, this approach follows a shared-nothing architecture (meaning each node is entirely self-sufficient) and is linearly scalable. As such, scaling simply requires adding more servers as needed, without code changes or complex provisioning. Moreover, if we know the throughput of a single server, we know exactly how many servers we will need to achieve future or unanticipated requirements. All we need to do is multiply the number of servers by the throughput per server. Remember, because it is linearly scalable, it does not suffer from diminishing returns, and there is no 'scalability wall'. So if one machine can process 1k orders/sec, two machines can process 2k orders/sec and N machines can process Nk orders/sec.

In the case of our team above, if they had taken this approach, they would have only needed two servers to double the throughput (with a total solution cost of $220k - compared to $1.31 million! ) and three servers to triple it (with a total solution cost of $250K - compared to who knows what). And if they had faced any unanticipated peaks (thanks to Mylie), they could have quickly scaled for that as well.

What's interesting about linear scalability, is that even if we assume no throughput increase at all PER SERVER, the cost savings are still remarkable. In our case, let's assume that the linearly scalable solution still has a throughout of 200 orders/sec per single server.   In order to achieve the 2,000 orders/sec throughput, we will just need 10 servers, compared to 100.

The example above illustrates that what initially appeared to be a low-cost solution (free middleware), became extremely expensive as scalability requirements grew.

Measuring the Cost of Scalability: A Cheat Sheet

Following is a short summary of what we should consider when measuring the cost of system scalability (or lack thereof):

  • Cost of hardware and software, as a function of:
  • How many CPUs (or machines) are required to achieve the desired throughput/concurrent users/latency, given the:
    • Throughput per single server
    • Level of contention, and therefore, the required number of servers needed to scale as prescribed by Amdahl's Law. This calculation needs to be performed given different scale requirements (2x, 3x, 5x, 10x, etc.), as it will grow exponentially
    • If we cannot achieve on-demand scalability -- we also need to consider the cost of hardware and software required for over-provisioning to ensure we can handle peak loads
  • Cost of development, QA and testing
    • Initial design and development cost, including learning curve and integration
    • Cost of re-designs and re-writes for when we need to scale our application
  • Cost of failure – what is the cost of downtime due to under-provisioning or inability to scale on-demand. This should consider direct revenues loss, compensation payments, loss of productivity, damaged reputation and future income and so on.
  • Provisioning, deployment and operations costs - including management and monitoring. The more complex the system is, the more difficult it is to identify bottlenecks and root causes.

A Final Note on Comparing Solutions:

In the context of scalability and ROI, when we evaluate competing solutions, we need to make sure that we are comparing apples to apples.

  1. Comparing Apples-to-Apples: It is not enough to measure the license cost. We need to normalize it with the performance and scale factors. For example, if two products cost the same in terms of cost/CPU, but one performs 5 times better, then the cost of that product is, in fact, one fifth of the other
    .
  2. Total Cost of Ownership (TCO): We cannot look at the software license costs alone. We need to assess the overall cost of the system, including hardware (and other platform costs, such as OS), additional components required (such as load balancers and storage), cost of development and cost of failure. In the final analysis, free products that are not linearly scalable are going to cost much more than commercial products that are.
       
  3. End-to-End Measurement: When it comes to scalability, you are only as strong as your weakest link, so assessing the cost of scaling requires a holistic measurement. Before we compare two products we need to understand how they each play a role in achieving end-to-end scalability. Linear scalability requires an end-to-end solution. Solutions that are built from a bundle of tiers and products are likely to be non-linearly scalable, as contention is created by the integration of the tiers and products, the need to ensure consistency, different clustering models and other issues. This means that before we can even measure or compare cost, we first need to compare what it takes to reach linear scalability with each product. It might be that on a simple caching or messaging level, two products provide the same level of scalability. When, however, we need to integrate the messaging system, use a transaction manager, a database or filesystem to ensure reliability, our  end-to-end scalability is going to be significantly limited.

This post was co-authored with Geva Perry.

April 30, 2008

Cool Projects on OpenSpaces.org

The OpenSpaces.org community site launched in January. I was surprise by the rapid adoption of OpenSpaces since then, with lots of interesting innovations on things I didn't even think of. I'm sure that some of the projects will be very useful to many OpenSpaces users. This shows the value behind  an ecosystem and community. Given the right tools, people will start collaborating and share things that otherwise would be buried in their hard disk, or in their mind.

The OpenSpaces.org site also provides a great tool for GigaSpaces Partners and individuals in the general developer community to expose their skills by publishing valuable content. A good example is GridDynamics, a GigaSpaces partner, who invested time and effort on producing high quality, well-documented projects.

The same goes for various people on the GigaSpaces team who came up with great ideas based on work that they did with customers. They use the OpenSpaces,org platform to share the tools they developed with other users in the community who might have similar needs. For example, the OpenSpaces demos project shows how to integrate Ajax, Spring MVC and OpenSpaces to scale a typical web application (market data front end, in this specific case). 

Another good example is TGris, an extension of the testing grid framework that we use internally at GigaSpaces, and which several customers showed interest in for automating the testing of their own applications (note that the tool is not specific to OpenSpaces).

Another class of  interesting projects are those that integrate OpenSpaces with various frameworks and APIs. These projects simplify the integration and adoption process, and shorten time-to-value. Good examples are the projects that provide integration for OpenSpaces/GigaSpaces with Amazon SimpleDB, JPA, and Memcached , as well as the  Cache Integration project, which enables OpenSpaces/GigaSpaces support for many frameworks, such as Acegi Security, Cocoon, Jetty, iBatis, OpenJPA, Velocity and others.

Other people built entire functional applications,  such as Leonardo Gocalves's  GoDo - Goods Donation System (see details below), and Jim Liddle's MobileGSFeed, which provides a scalable solution for handling Atom feeds through the iPhone. Jim actually runs our Sales in the UK & Ireland. Never in my dreams did I imagine that OpenSpaces.org would be used by sales guys :-)

Anyway, I'm very pleased to let you know that we reached an important milestone for OpenSpaces two weeks ago when we reached the deadline of the developer contest. Fourteen candidates made it to the final stages. Only three will be finalists. A distinguished panel of judges interviewed each contestant. The judges are Adrian Colyer, CTO, SpringSource; Joe Ottinger, Editor, TheServerSide.com; John Davies; Julian Brown, Architecture Consultant, RWE;  Keerat Sharma, Platform Engineer, Gallup; and Ross Mason, Co-founder and CTO, MuleSource.

All of the candidates put up a real good fight and made it very hard for the judges to reach their final decision. The winners of the contest will be announced in a nice venue in Prague during TheServerSide Java Symposium event. Stay tuned for updates on the exact date and venue here and on The GigaSpaces Blog and web site. We also intend to publish interviews with each of the finalist project owners and post them in a blog.

Here are some of the interesting projects (in alphabetical order). The full list of projects can be found here.

Please join one of the projects or start a new one yourself. If you already developed something, but are concerned about the time it will take to initiate a new project -- don't be! It is extremely easy and quick to start a new project and if you need any help, we're ready to support you.

 

 

 

 

December 17, 2007

Who needs standards anyway?

There is an interesting debate taking place on InfoQ: What role will the JCP play in Java's future?

"Alex Blewitt described the Java Community Process (JCP) as dead, likening it to a headless chicken which "doesn't realise it yet and it's still running around, but it's already dead". This touched off a debate over the usefulness of the JCP and how much it will play a role in Java's future"

The traditional role of standards was to define a spec that everyone will comply with, and consequently, an open market around the standards will emerge. This, presumably, will enable customers to choose the best implementation and avoid vendor lock-in.

A quick look into current trends reveals that things like PHP and the Spring Framework, neither of which are formal standards, are being adopted more rapidly than many of the existing formal standards. These frameworks are able to quickly respond to market needs, while traditional formal standards lag behind.

Perhaps more importantly, even in projects that use a formal standard, such as JEE, there is typically heavy use of non-standard, proprietary components that come with the products that implement the standard. So even in these cases, the fact that only a part -- sometimes a small part -- of the project complies with a standard significantly reduces the value of the standard. This is especially true in the Java world.

Alex Blewitt focuses on the reasons why JCP isn't successful, but his arguments are true for other software standards bodies as well. IMO there is a more fundamental problem that goes beyond JCP or any other process. At the heart of it is the fact that the primary focus of such standards bodies is on agreeing on a specific API or protocol. This focus is too low level and is often tied to specific semantics that don't leave enough room for innovation and optimization, and therefore, doesn't encourage healthy competition on the implementation of the standard. This makes it very hard to find common ground for agreement and often leads to ugly compromises, tediously protracted processes, etc. These ugly compromises and prolonged processes lead to dissatisfaction from users, and consequently, to lack of adoption  -- and the vicious cycle continues.

Right now we are experiencing this Catch 22 scenario where we want to have an open market and avoid vendor lock-in, but the process to achieve that openness via standards is somewhat closed -- and basically broken. It is this realization that brought me to ask the question: who needs standards, anyway?

The goal of keeping an open market is an important goal that we should continue to strive for. What we need to do is change our thinking and break away from the traditional approach to the role of standards as we know them today.

Here are some ideas to help address the issues:

Leverage open source communities and product adoption as a way for defining de-facto standards

We don't need any regulatory process that will determine which project can be applicable as a standard or not. Anyone who thinks that they have good idea can just spawn their own project and do pretty much whatever they like with it . Adoption, or lack thereof, will be the measure of relevance of the project. In other words, if you're doing the right things and solving real problems then most likely it will be reflected in higher adoption. Adoption is an excellent tool for measuring success.  To make this tool even more effective, you can also apply measurements of adoption, similar to Google PageRank. The rank could be based on number of downloads, number of references in the blogosphere, the quality of those references, etc. One of the implicit benefits of this model is that it brings back the power to the developer. It also  creates a highly competitive environment that encourage innovation and alternative thinking.

Keep It Loose

Leverage new methodologies, such as the declarative abstraction model (dependency injection), annotations and Aspect-Oriented Programming (AOP) to create a loose plug-in model. With this approach, we can plug-in different implementations that don't necessarily comply with the same API -- or even technology. Mule is a very good example of this approach.

With Mule you can plug-in different protocol implementations, as well as different event sources -- such as JMS, Web Services, JavaSpaces and others -- and inject the event into your businesses logic from any of those sources in the same way. The key is that you can achieve this without changing your business logic every time you add a new data-source, and without forcing the use of a specific event-handling API, such as JMS, for example. Spring Remoting does pretty much the same thing. We can write our Service as a POJO and then use the remoting abstraction to invoke that service through a local-proxy, remote proxy, and enable the dynamic creation of proxies, without enforcing the implementation to bind to a specific protocol (IIOP,JRMP) or model.

Commercial vendors can be involved in this process by contributing back to these open source projects, and through that influence their evolution. A good example for this is MySQL, which is supported by big companies such as SAP, and companies like Google or eBay who contributed large portions of code when they found things that were insufficient to meet their needs. They are not making these contributions for philanthropic reasons. Their interest in doing so is quite straightforward: if it's part of an open source project, it's going to be maintained and tested by an entire community of users, and become a de facto standard.

Now don't get me wrong: this model is far from perfect. One of the biggest challenges is that we will end up with too many options to choose from and too many frameworks that don't work together.

However, if we consider the LAMP stack, and how it evolved without a big brother that controls the process, there is reason to be optimistic that this model could work, and is the best one we have.

My Photo

Twitter Updates

    follow me on Twitter