Part One: An Overview of the Current Situation
It was only a few years ago when nearly everyone relied on SQL exclusively to tackle Big Data needs, but as the demand for speed and space increases, so have our options. Now there are a number of new data systems that are mostly based around NoSQL, with each of them having been developed to best serve specific areas.
In this post, we'll be taking a look at seven APIs in particular and explore how these systems can be optimized for maximum speed and memory capabilities.
The Seven Most Popular APIs in Big Data
1. SQL: The term “SQL” may suggest that this API is no longer relevant in this new data world, but most of NoSQL implementations support a major subset of SQL. This system is able to provide a rich set of query and data management and is often the least common denominator of many data management systems.
SELECT EmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City = 'London'
2. Document: The document API allows users to write a different structure of fields to the same logical table without any need for schema evolution (which is why document API is often known as “schema less”). It's one of the most popular for web applications that uses the JASON data model.
db.inventory. insert( { _id: 10, type: "misc", item: "card", qty: 15 } )
3. Object Graph: This navigation API is most suitable for hierarchical data structures (i.e. social graphs).
firstNode = graphDb.createNode( );
firstNode.setProperty( "message""Hello, ");
secondNode = graphDb.createNode( );
secondNode = .setProperty( "message", "World! ");
relationship = firstNode.createRelationshipTo (secondNode, RelTypes.KNOWS );
relationship.setProperty ("message" , "brave Neo4j" )
4. Tuple API: Tuples are one of the most common APIs for messaging and stream processing use cases. This API represents a simple data structure that is able to map into a flat data object. It’s often based on using the same tuple structure that was used to write the data instance as the query language, meaning that the tuple acts as a “mask” that indicates which instance type and matching fields are to be selected.
JavaSpace space = getSpace( );
AttrEntry e = new AttrEntry ( );
e.name = "Duke";
e.value = new GIFImage ("dukeWave.gif");
space.write (e, null, 60 * 60 *1000);
// lease is ignored -- one hour will be enough
5. Key/Value: Key/Value represents the simplest form of data structures. As the name suggests, it consists of a single index per data object. This API is the most popular for caching and is often used as the underlying data structure of more advanced data management solutions.
Memcached Client c=new Memcached Client(new InetSocketAddress ("127.0.01", 11211));
c.set ("someKey",3600, someObject);
Object myObject = c.get( "someKey");
c.delete ("someKey")
6. Stream Based: This event processing model is the most suitable for handling any scenarios where continuous updates will be necessary. It's a popular API for real-time analytics scenarios, which explains why it's becoming increasingly popular for Big Data systems that rely heavily on incremental updates, but which don't require the locking of a large set of data.
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout (), 5);
builder.setBolt ("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt ("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"))
7. Map/Reduce: This API is used to perform aggregation on distributed data. The Map/Reduce model is able to break up aggregation operations into two or more phases. Map executes the aggregation in each data node, and Reduce takes all of the sub aggregations from each node and then reduces them into one consolidated result. Operations such as calculating max, average, and mean is an example of the Map/Reduce Model.
CREATE TABLE input (line STRING);
LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO TABLE input;
-- temporary table to hold words...
CREATE TABLE words (word STRING);
and file splitter.py'
INSERT OVERWRITE TABLE words
SELECT TRANSFORM(text)
USING 'python splitter.py
AS word
FROM input;
SELECT word, count (*) AS count FROM words GROUP BY word;
Which System Is Best for Me?
There is no “one-size fits all” approach when it comes to data management systems. Because most of today's data management systems have an API which is tied to a data model in which the data is stored, we can't write data in one API and read it with another.
This means that if you want to use that same data for a different purpose, one would need to maintain copies of that data to match each use case API and data store. As such, a typical application would need to include a combination of various data management solutions with complex data flows between them, as illustrated below:
But does it really have to be that complex?
Stay tuned for “The Seven Most Popular APIs in Big Data—Part Two” to find out!