Two weeks ago I gave a presentation at the Sun HPCW workshop in beautiful Regensburg, Germany. It was a great chance to meet many of the folks from the Sun Grid Engine team face to face. It was also a great chance to take a step back and look at the evolution of one of the interesting areas in the industry - analytics. Even though this is one of the more technically challenging areas, I find it odd that relatively little has been discussed about this within the developers community.
In this post I will try to provide an overview of the different approaches for running analytic applications. I will also examine how patterns like Map/Reduce and a new environment such as the Cloud, drive a new approach that is significantly simpler and more cost effective than many of the existing approaches. There are other models for analytics such as stream based processing or CEP that i chose not to cover in this post as they deserve represent a new paradigm that deserves a separate discussion.
What is Real-Time Analytics (a layman’s definition)?
Here is the simplest and shortest definition I could come up with (probably not the most accurate one, but good still good enough):
· Analytics: The process of manipulating raw data (a.k.a data mining) into a meaningful report.
· Real-Time Analytics: The ability to generate the desired report in near real time.
Analytics in the Finance Industry
It is interesting to note that financial applications within organizations are organized based on the “speed” with which they need to react to the market. The Front-Office for example, would be the trading/market data application that needs to react to the market event in a fraction of a millisecond. The Middle and Back-office is the part of the organization that looks at an aggregated view of the market and tries to draw various trends, based on the current market and history. The algorithm that is used to process that data and generate the required report is referred to as Monte Carlo simulation. A typical set of reports generated in that model would be P&L (Profit & Loss), VAR (Value at Risk), Reconciliation etc.
In previous years, those reports were generated on a post trade basis, i.e. at the end of the trading day, the data from the current day was aggregated into files or a database, and during the night there was a set of services that processed the data and generated the desired reports. That meant that if there was an event during the trading day that could affect the portfolio, it would only be taken into account the next day. At the same time that the volume of data increased, the window of time to process that data (during the night) became shorter due to globalization (time zone differences shortened the “night”). This led to scenarios where simulation accuracy was limited by these margins.
Analytics Becomes Mainstream
In general you can categorize any application that looks at an aggregated view of a stream of data, as an analytic application. Fraud detection would probably be another good example of analytics, since it tries to continuously monitor user actions in order to determine whether their action is part of a fraud or not. Social networks, Ecommerce and even personalization, such as search engine personalization have become more mainstream. With that, the demand to process vast amounts of data to produce various market trends, user behavior, fraud behavior etc. becomes not just useful, but critical to the success of the business. A search engine would be a good example of that – the quality of the search results is a direct function of how well you are able to aggregate content and user search patterns, and apply it to the way you produce the search results.
The Evolution of Analytic Application Architecture
To make things simple, I will start with a very basic view of the analytical application architecture, which represents the two most common architectures:
Phase 1: First-Generation Analytical Applications
Phase 1-1: Excel spreadsheet
Everyone uses Excel to generate various reports, so the use of Excel and spreadsheets are probably the most common and popular way of generating reports. The way this is done is by loading the data from a file, and then using Excel to generate the graphs and various views of that data.
Phase 1-2: High-End ETL process
Excel works well for small data sets, but for large data sets a batch process is used that is based on ETL (Extract Transfer and Load). It represents a batch process in which the operational data from the production sites is normally aggregated into some sort of data warehouse. In the process of doing so, the data is stored in a format that is optimized for generating reports. The process normally involves extracting the data, transferring it into an optimized format, and loading it into the warehouse for further processing. The ETL process normally happens once per data source. The data processing (a.k.a data mining) is where most of the heavy duty work takes place. In many cases, once the relevant reports and analysis have been produced, the data becomes obsolete and at this point it is common to clear it and store it in a long term archive.
The Need for Real-Time Analytics
You don’t need to be a genius to realize that that if all reports could be generated in real time, this would probably be the ideal world. The only reason this is not done, is purely technical. No way had been found to aggregate all this vast amount of data and process it in real time. Traditional databases had to be tuned completely differently, to deal with real time transaction processing and complex analytics (type of locking, data structure, transactional semantics etc.).
“Real-time data … requires a continuous flow of data, and traditional enterprise data warehouse deployments introduce slow data delivery to BI environments” (Forrester)
The challenge became even worse as the volume of data increased exponentially, and at the same time there was a continuous pressure by businesses to process even more complex analysis at an even lower cost. It was clear that addressing those contracting requirements couldn’t be achieved by just “throwing more hardware” into the problem. That led to the next phase of analytic application architecture, i.e. a grid-based analytic system.
Phase 2: Second-Generation Grid-Based Analytical Applications
Phase 2-1: Speeding up Analytics via Parallel Processing
Phase 2 was not yet trying to get into real-time analytics but was mainly trying to find ways to speed up the process of analytics. The majority of the focus in that area was on the batch processing that in some cases could take days or hours. By introducing grid computing, the processing time could be brought down from days to hours, and from hours to minutes.
With grid computing, the way that processing time is speeded up, is simply by spreading the processing into a large pool of commodity hardware. By splitting a linear process into small chunks that can be processed in parallel, the processing time can be speeded up quite significantly.
Phase 2-2: Adding an In-Memory Data Grid
As the time to process the data went down from hours to minutes, other factors that were previously considered insignificant became much more significant. It became clear that to speed up the analytics beyond the current level, closer to real time, it was necessary to consider ways to speed up the load and data access time, because they became more significant relative to the processing time.
“Real time drives database virtualization - Database virtualization will enable real-time business intelligence through a memory grid that permeates an infrastructure at all levels" (Forrester)
This approach became fairly common during the past few years. There is more information about this approach in one of my previous posts: Bringing Data Awareness to the Grid
Challenges with Grid-Based Analytical Systems
While technically the combination of Compute and Data Grid were able to bring the time it takes to perform even very complex analytics to near real-time, the adoption of that solution was still marginal. One of the main reasons that this solution was not widely adopted was cost and complexity. Only a few organizations could afford building and maintaining large data centers. The skill set for running and building applications that could run in those environments was fairly rare, and therefore the labor cost was also rather high.
Phase 3: Next-Generation Public/Private Cloud-Based Analytics
The cloud makes a grid-based system affordable. So with the cloud it is possible to reduce the hardware cost, and more importantly it avoids the huge initial investment. The fact that the cloud is more widely adopted also reduces the labor cost and ramp-up time required to build such systems.
Having said that, the fact that anyone, even the smallest startup can have a large data center at a distance, reduces not just the initial investment and the direct hardware cost, but opens up a new set of opportunities for building such applications. For example it is possible to outsource the entire analytic process to someone else and use it as a service.
This was the case with Primatics Financial that provides risk analysis for mortgage portfolios as a service.
The Cloud and Map/Reduce Make Analytics Simple
With grid-based systems that run in the data center, the pool of machines is fixed. One of the main roles of the grid scheduler is to manage the pool of physical resources which belong to that pool. The cloud on the other hand already takes care of most of the resource scheduling part. Virtualization makes it possible to get the isolation required to maintain the separation between multiple jobs. If these two factors can be leveraged, a simpler model can be developed.
One of the ways to simplify the analytic application is to simplify the way that parallelization is handled. Since most of the resource scheduling is dealt with through the Cloud/Virtualization layer, a simple pattern such as Master/Worker, Map/Reduce can be used to parallelize, i.e. to distribute the application across the available machines.
You can read more about the Map/Reduce, Master/Worker approach here, and how it compares with a grid-based system.
Reducing the Number of Moving Parts
Another way to simplify and optimize the application is to reduce the number of moving parts.
Rather than having a different set of services to load the data, hold it in memory, and yet another set of services that computes the analytics, it is possible to bundle the loader, the data and the processing services into one application unit, and have multiple instances of those units running to handle the scale. In this case, each unit handles the entire life cycle (the loading, processing etc.) of the data analytics food chain.
SaaS Enablement through Multitenancy Support
One of the core elements that is required to provide the analytics application as a service is multitenancy. Multitenancy is the ability to run the multiple analytics processes on a shared infrastructure as if they were running on dedicated resources. To add multitenancy to our service, the life cycle (Start, Stop etc.) of each of the analytics needs to be managed independently of the other analytical processes that are running over the network.
There are two possible layers of isolation that can be chosen:
Service-level isolation provides isolation at the application component level. You can deploy multiple services on the same process/machine; you can run different versions of that same service in the shared environment and maintain them in complete isolation to other services that are running in the same environment.
Since all services share the same machine, they can potentially access shared resources such as the file system, etc. It is also harder to control things like CPU utilization and memory allocation on a per service level.
Machine-level isolation is provided through the use of a virtual machine. In this case, each application or group of applications runs on separate virtual machines. Virtual machines provide an additional level of isolation that includes the actual CPU, Disk and Memory resources.
Combining Service-Level and Machine-Level Isolation
Both service-level and machine-level isolation are actually complementary approaches, as each one of them provide a different level of isolation and application packaging granularity. Service-level virtualization tends to be more fine grained, but at the same time provides a lower level of isolation. Therefore, it makes sense to use it in areas where real CPU, Memory and file system isolation are not that critical, but getting better utilization from the existing machines is more important, and vice-versa.
In the diagram above, you can choose to run everything from the same organization at a service level of isolation, and use a VM level of isolation between different customers to ensure that each customer’s application is running in a completely private set of VMs.
With the introduction of private cloud providers such as VMware with their Vcloud and Vsphere approach, and open source alternatives such as Eucalyptus, you can choose to run the same solution in a hosted environment or locally. You can also use a hybrid approach where some of the processing runs in your local environment and the rest in the hosted environment. Things like the Amazon Virtual Private cloud make such a hybrid deployment model fairly simple, as the local and hosted machine are accessed in the exact same way.
The continued demand for processing a growing amount of data in real-time and at a lower cost, drives continuous innovation in the way data is processed. Grid-based analytics (Compute and Data) addressed the demand for scaling, and also reduced processing time quite significantly, but at a relatively high cost and complexity. The demand for real time analytics by mainstream applications in the social networking media created a drive for a simpler and less costly solution. Cloud computing enables the cost of hardware to be reduced, but at the same time opens up new opportunities for making the analytic processing significantly simpler. You can rely on the cloud for scheduling and managing your resources, and on virtualization for multitenancy. With that, a simplified parallel processing approach, such as Map/Reduce, can be used to handle the parallelization of the application.
“it's quite clear that the business-intelligence/data-warehouse industry is moving toward a new paradigm wherein the optimal data-persistence model will be provisioned automatically to each node based on its deployment role -- and in which data will be written to whatever blend of virtualized memory and disk best suits applications' real-time requirements." (Forrester)