IBM Makes 'Huge Bet' on Apache Spark Data Engine

Apache Spark, the open source distributed data processing system, may be about to light a wildfire under business applications.

At least, that's the "huge bet" that IBM is making on the project, a high-octane offshoot of Hadoop, says Joel Horwitz, director of portfolio marketing for IBM Analytics Platform.

"We've identified Spark as being one of the fastest growing open source projects ever," Horwitz tells SDxCentral. "It's just taken off."

If you've never heard of Spark, you aren't alone. The project only picked up true steam in February of last year, when the Apache Software Foundation released it as a top-level project. But little more than a year later, Spark users include web titans eBay, Yahoo, Alibaba, and Amazon, to name a few.

So what on earth does Spark do? In short, it's an extremely fast cluster computing framework for working with distributed, unstructured data. Designed for data scientists and application developers, it allows users to run powerful analytics on unrelated data sets — say, transaction data and Twitter habits.

Hadoop, the cluster storage framework, has a similar processing component called MapReduce. But, using a storage abstraction that allows applications to keep data in-memory across queries, Spark allows machine learning algorithms and certain other applications to run more than 100 times faster than on Hadoop, according to a UC Berkeley study.

To put it in perspective, that's roughly the difference between the fastest human on record (Usain Bolt) and the fastest jet aircraft (the Lockheed SR-71 Blackbird).

"Hadoop is a great data management system, but it wasn't very powerful as a data processing system," says Horwitz. "Imagine you have a huge hard drive with lots of videos on it, but you don't have a video player." Spark is like that video player for massive data sets.

That makes it a compelling project for IBM, which is increasingly positioning its cloud services as a way to plug into data and analytics APIs. On Monday, the company announced it would be contributing its SystemML machine-learning program to the Spark codebase, and will use Spark at the core of its analytics and commerce platforms.

As well, IBM will offer Spark-as-a-service though its Bluemix developer platform.

The strategy hinges on Spark's approach to handling data going mainstream just as the analytics boom goes supernova. By contributing SystemML, IBM hopes to propagate its approach to writing algorithms as the industry standard.

If that happens, says Horwitz, IBM will be in pole position to cash in on its data crunching and analytics projects like the Watson cognitive computing project.

"Spark is like the mint of the insight economy, and the currency is algorithms," he adds.

"Instead of the space race, it'll be the algorithm race."

IBM Makes 'Huge Bet' on Apache Spark Data Engine

Tags

AI Data Centers: Scaling Up and Scaling Out

Advanced Networks for Artificial Intelligence and Machine Learning Computing

DCD>Survey: Data center networking trends

Future-proof your datacenter with DDC S-Series