Introduction to distributed computing with Spark

This is a companion post I wrote for the talk “Introduction to distributed computing with Spark and Dask” I delivered together with Adrian Pino Alcalde at Kernel Analytics in June 2016. Adrian took care of the Dask part, while I concentrated on Spark.

In this post we will provide an introduction to Apache Spark. We will cover what it is, why would you want to use it, how it is different from Hadoop, the basic concepts of the Spark architecture and conclude with some example applications.

What is Spark?

Apache Spark is a general purpose framework for cluster computing started around 2009 at UC Berkeley College. It was later open sourced and donated to the Apache Software Foundation, becoming a top-level project in 2014. Last version, v1.6.1 was published on March 9, 2016, so it’s definitely a current and maintained project. It is written in the Scala programming language (which is super cool itself!) and has interfaces in Java, Scala, Python 2 and R. A recent (2014) O’Reilly survey shows that people who use Spark (and Apache Storm) have the highest median salaries among all Data Science practicioners… although correlation does not imply causation! 😉

Use of Spark is almost always paired with Big Data. Although you can execute arbitrary functions, most Spark primitives and functions are designed to deal with data: Filter it, transform it, massage it, rearrange and join it, create a model from it.

But let’s pause for a minute. What does cluster computing mean? A long time ago very smart people realized that some computations can be splitted into “partitions” and then performed in parallel, aggregating the result in the end. For example, if we want to compute the sum of a long list of numbers, we can train to get awesomely fast adding up numbers, or we can partition the list into multiple chunks, give each chunk to a not-so-fast friend, ask them to compute the sum of their chunks, tell these numbers to you so then you can compute the global sum of the whole list.

GSum = sum{x_i } = sum_f{ sum {x_j} }

Here you are taking advantage of the fact that every one of your friends can compute his sum on his own, effectively parallelizing the whole computation and speeding it up a lot. Of course there is an overhead in telling your friends what to do and passing the messages back and forth, so there is a limit on the amount of speedup you can get from parallelization this way.

Now, translating to the computer world, this means that you can use multiple commodity machines to run calculations in parallel instead of one super-big expensive machine. This is the foundation of cluster computing and what Spark does: Orchestrate a group of computers to perform some computations.

The orchestration in Spark happens within a driver-workers architecture. A Spark cluster consists on:

  • A driver node. It is responsible of the creation of a SparkContext object, which is the entry point to the whole application. This is where the definition, scheduling and monitoring of the tasks happen.
  • A set of distributed worker nodes that are slaves executing tasks.

It is important to emphasize that what is “moving” here is the code to be run (in the form of closures represented by e.g. the lambda functions), from the driver to the workers.

Spark Cluster Driver Node Driver Spark Context … Worker Node … Executor Cache Task … Task Task Executor Cache Task … Task ...

Image credit: Alexey Grishchenko

If the code sent to the workers has errors and raises an exception, it is wrapped and sent back to the driver node to help debugging. As an additional godsend, the driver node provides an awesome Web UI to inspect the cluster and tasks status.

Why use it and how is it different from Hadoop?

Apart from for getting a higher paycheck at the end of the month, you are surely wondering why on earth why should you learn yet another Pokémon Big Data technology? What does Spark add to the Hadoop stack?

There are multiple reasons that motivated the development of Spark, but perhaps the most important one was an attempt to improve the performance when using iterative algorithms, very common in Machine Learning. A typical Hadoop MapReduce job writes to disk 3 or more times; if we wanted to iteratively perform this job in Hadoop we would have to hit the disk many multiple times, effectively reducing performance a lot.

Spark brings three features to avoid the iteration-wise and other performance bottlenecks:

  1. First, it provides lazy computations (using RDDs and DAGs, covered later). This means that the defined computations are not performed until they are deemed necessary, i.e. when a result has to be printed to the screen or saved to a file. This way we can interactively chain operations and Spark will automatically decide what needs to be computed (optimizing as much as possible) to produce the final output before starting any computation.
  2. Second, Spark has in-memory (LRU) data caching, allowing efficient reuse of data read from disk in multiple iterations. It is important to emphasize here that having a RAM cache is not a synonym of Spark being able to perform all its computations in-memory. This is a common misconception: Spark puts data to disk many times, for example during data shuffles in the reduce or join step. So using an SSD will certainly improve Spark performance a lot.
  3. Last, it allows efficient pipelining, avoiding spill to hard disk as much as possible. We will see how this works later.

Another motivation behind Spark is to be able to write pipelining code much more easily. For example, the Java code of the WordCount basic example for Hadoop is 61 lines (see at the end of post) and is unfathomable. The same example can be written in Python or Scala for Spark in 5 lines, and is easy to understand in the context of the Map-Reduce paradigm.

text_file = sc.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
            .map(lambda word: (word, 1)) \
            .reduceByKey(lambda a, b: a + b)

Does all this mean that we don’t need Hadoop at all when using Spark? Yes and no. In essence, Hadoop is the sum of the Hadoop Distributed File System (HDFS) and a MapReduce implementation. Spark does not include itself a way to read and write data to a filesystem in a distributed (cluster) fashion, so you usually run it inside a Hadoop cluster (using YARN or MESOS) to use HDFS. There are reports of some people running Spark with other distributed file systems like AWS S3, but there are still a lot of rough edges to take care of.

You can also run Spark in standalone mode without any distributed file system, but think about the implications of this in the context of Big Data for a second.

A key performance booster in cluster computing with Big Data is the concept of data locality. If we had all the data in a single place and want to distribute the computation over several nodes, we would have to send this data over the network to all of them. If we are talking about TBs, this would be a huge performance bottleneck. A much better strategy instead (and what HDFS allows) is to directly partition and store the data in a distributed fashion among all the nodes. We can then instruct all the nodes to perform the desired computation over the partition of data they already have and only send the final aggregated results over the wire, which are hopefully much smaller (think of partition arithmetic mean vs. whole list).

When should you consider using Spark? As briefly noted above, there is a considerable overhead involved in scheduling tasks and passing messages along, so setting up Spark usually only pays off if:

  • Your client already has a Hadoop distributed cluster for processing and storing the data where you can plug Spark.
  • You will have to deal with hundreds of GBs of data and have the $$$ to setup a cluster of multiple machines.
  • You want to use Spark MLLib distributed algorithms for Machine Learning.

If you have a large workstation and a small database of, say, dozens of GBs, you might better consider using a classic SQL/noSQL database (for row-like data) or a lightweight out-of-core system like Dask for numerical data.

Spark data abstractions

Resilient Distributed Datasets

The fundamental unit of data in Spark is represented by a RDD. You can think of a RDD as a representation of a dataset distributed among the nodes of the cluster. It’s an immutable piece of metadata that contains information about where the data comes from and how it has to be transformed. This metadata is stored in the driver node mentioned above.

New RDDs are created loading data (e.g. from HDFS) or applying transformations on already existing RDDs. For example, we can load a list of numbers from disk into RDD1 and apply the “add three” transformation to all of them, resulting in a second RDD2.

RDD1 --- “add three” ---> RDD2

A key feature of RDDs is that the lineage of transformations is recorded and applied in a lazy fashion, i.e. their execution is deferred until it’s needed to produce a readable output. In the example above, RDD2 will keep track of that the data it refers to will be the result of applying the “add three” transformation to the data. This way we can fearless chain (pipeline) transformations in an interactive way without performance penalties. Additionally, if a certain machine crashes when performing some transformation, it can be retried easily because the RDD metadata says what should be done.

To efficiently schedule the transformations among the cluster Spark uses a more advanced abstraction of the execution sequence called Directed Acyclic Graph (DAG). This is basically a graph where the nodes are RDD partitions and the edges are transformations applied to them. By inspecting the graph, Spark groups into stages all operations that can be performed working on a single partition for performance. (As a side-note, DAGs are a very powerful abstraction used for scheduling in Dask as well.)

The image below shows the DAG in the WordCount example:

Image credit: Alexey Grishchenko

There are many transformations in Spark (see Spark docs), but probably three of the most useful ones are:

  • map: Apply an arbitrary function to every element of the RDD. Example: x: x + 3)
  • filter: Filter out elements not satisfying the specified condition. Example:
    rdd1.filter(lambda x: x > 5)
  • reduceByKey: Aggregate values by key. Example (word count): c1, c2: c1 + c2)

Ok, remember that some lines above we said that transformations are deferred until they are “needed to produce a readable output”? Well, what triggers this “necessity” is the application of a different set of Spark functions called actions, that turn a RDD into something different. Again, some of the most common ones are:

  • reduce: Aggregate the elements of the RDD using a commutative and associative function that takes two arguments and returns one. Example:
    rdd1.reduce(lambda v1, v2: v1 + v2)
  • takeOrdered: Get the N elements from a RDD ordered in ascending order or as specified by the optional key function. Example:
    rdd1.takeOrdered(6, key=lambda x: -x)

The results of the action are sent to the computer driving the computation (the driver), so one has to be careful not to produce too large results that would be costly to send over the wire. In the example above, using rdd1.takeOrdered(10e10)can potentially blow up your driver computer.

Spark DataFrames

The Spark DataFrame API was introduced with Spark 1.3 (Feb 2015). It closely mimics the API of a Pandas dataframe, and in fact Spark DataFrames can be converted to and from Pandas dataframes (see df.toPandas() method and sqlContext.createDataFrame(pandas_df)). Spark DataFrames can be created as well from Hive tables, Parquet files, MySQL, PostgreSQL, AWS S3 and more.

DataFrames are a wrapper around RDDs, where data is stored in a columnar format (so working with a small subset of columns is more efficient), along with field names, data types and some basic statistics for each column. The DataFrames API was purposely developed to provide a much more human interface for Big Data analysts to use Spark. Instead of wrapping your head around the map-reduce paradigm, you can now use high-level data functions. For example, if users is a DataFrame:

# Create a new DataFrame that contains “young users” only
young = users.filter(users.age < 21)
# Alternatively, using Pandas-like syntax
young = users[users.age < 21]
# Increment everybody’s age by 1, young.age + 1)
# Count the number of young users by gender
# Join young users with another DataFrame called logs
young.join(logs, logs.userId == users.userId, "left_outer")

Much easier, uh?

One of the easiest ways to create a DataFrame is to wrap a RDD of Rows. For example, if we have a file with path ratingsFilename with rows with the schema userid,movieid,rating we can do:

from pyspark.sql import SQLContext, Row
sqlc = SQLContext(sc)

def parseRatingsDF(line):
   parts = line.split(',')
   return Row(userid=int(parts[0]), movieid=int(parts[1]), rating=float(parts[2]))
rawRatings = sc.textFile(ratingsFilename)
dfRatings = sqlc.createDataFrame(

DataFrames also bring a considerable performance improvement to Spark data processing. Why? Because they provide an unified API for data manipulation that avoids the overhead introduced by data serialization/deserialization in PySpark (managed through py4j). Think that Spark has to serialize the data from a Java object and deserialize it back to Python to be able to apply a PySpark-defined lambda function to an RDD. With DataFrames this is no longer needed, because when we do young.groupBy(“gender”).count() we are not executing Python functions anymore. Think of it as the speed-up provided by using NumPy high-level methods vs. plain Python functions over a list of numbers.

Lights, camera… action!

Best way to learn how Spark works is see it in action. As in many other tutorials and introductions, we are going to use the standalone version of Spark that doesn’t need a Hadoop cluster to run. It isn’t a real Big Data use case but the purpose here is to introduce the basic Spark functionality.

The examples are provided in the form of a Jupyter notebook. To run it, follow these instructions:

  1. Download this Jupyter notebook.
  2. Install Python and IPython/Jupyter in your system. In Ubuntu you can install the ipython-notebook package.
  3. Download the “Pre-built for Hadoop 2.6 and later” from and extract it somewhere in your Linux system.
  4. Start a PySpark IPython (Jupyter) notebook server with
    IPYTHON_OPTS=”notebook” path-to-spark/bin/pyspark –master local[n]
    where n is the number of simultaneous workers you want to run in your system.
  5. In the browser window that will pop up, open the Jupyter notebook you downloaded in step 1 and follow along.

Leave a Reply

Your email address will not be published. Required fields are marked *