Speed and simplicity

So what’s so great about Spark, anyway? The main advantage it offers developers is speed. Spark applications are an order of magnitude faster than those based on MapReduce –

as much as 100-fold, according to co-creator Mathei Zaharia, now CTO at Databricks, a company that offers Spark in the cloud, running not on Hadoop, but on the Cassandra database.

It is important to note that Spark can run on a variety of file systems and databases, among them the Hadoop Distributed File System, (HFDS).

What gives Spark the edge over MapReduce is that it handles most of its operations ‘in memory’, copying data sets from distributed physical storage into far faster logical RAM memory. By contrast, MapReduce writes and reads from hard drives. While disk access can be measured in milliseconds to access 1MB of data, in-memory accesses data at sub-millisecond rates. In other words, Spark can give organisations a major time-to-insight advantage.

Gartner analyst Nick Heudecker says: “One client I recently spoke to, with a very large Hadoop cluster, did a Spark pilot in which it was able to take a job from four hours [using MapReduce] to 90 seconds [using Spark].”

For many organisations, that kind of improvement is highly attractive, says Heudecker. “It means they can move from running two analyses a day on a given dataset to as many analyses as they like.”

At the Spark Summit in June, Brian Kursar, director of data science at Toyota Motor Sales USA, described the improvement his team had seen in running its customer experience analysis application. This is used to process about 700 million records taken from social media, survey data and call centre operations, in order to spot customer churn issues and identify areas of concern, so that employees can intervene where necessary.

Read more about Spark and MapReduce

Using MapReduce, the analysis took 160 hours to run. That’s almost seven days, Kursar pointed out to delegates. “By that point, [that insight] is a little too late,” he said. The same processing job, rewritten for Spark, was completed in just four hours.

Other big advantages that Spark can offer over MapReduce are its relative ease of use and its flexibility. That is hardly surprising, since Mathei Zaharia created Spark for his PhD at University of California Berkeley, in response to the limitations he had seen in MapReduce while working in summer internships at early Hadoop users, including Facebook.

“What I saw at these organisations was that users wanted to do a lot more with big data than MapReduce could support,” he says. “It had a lot of limitations – it couldn’t do interactive queries and it couldn’t handle advanced algorithms, such as machine learning. These things were a frustration, so my goal was to address them and, at the same time, I wanted to make it easier for users to adopt big data and start getting value from it.”

Most users agree that Spark is more developer-friendly, including Toyota’s Kursar, who said: “The API was significantly easier to use than MapReduce.”

A recent blog by Cloudera’s head of developer relations, Justin Kestelyn, claims that Spark’s “rich, expressive, identical” APIs for Scala, Java and Python can reduce code volume by a factor of between two and five times, when compared to MapReduce.

But this ease of use does not mean flexibility is sacrificed, as Forrester analyst Mike Gualtieri pointed out in a report published earlier this year. On the contrary, he wrote, Spark includes specialised tools that can be used separately or together to build applications.

These include Spark SQL, for analytical queries on structured, relational data; Spark Streaming, for data stream processing in near real time by using frequent ‘micro-batches’; MLib for machine learning; and GrapX for representing, as a graph, data that is connected in arbitrary ways, for example networks of social media users.