Download Spark
- First, download Spark.
I downloaded the binaries given as
spark-2.1.0-bin-hadoop2.7 - You'd also need to install the JDK; I took it from here
Set and test
Tree structure
Simply installing Spark is simple.
All you have to do is extract the archive.
I placed the extracted binaries in ~/Applications, resulting in the following tree structure:
~/Applications
└── spark-2.1.0-bin-hadoop2.7
At this point, you can already run Spark.
Look for ~/Applications/spark-2.1.0-bin-hadoop2.7/bin/pyspark.
For simplifying the settings, I created a symbolic link to ~/Applications/spark, in particular using ln -s spark-2.1.0-bin-hadoop2.7 spark, yielding the following structure:
~/Applications
├── spark -> spark-2.1.0-bin-hadoop2.7
└── spark-2.1.0-bin-hadoop2.7
Environment variables
Next step, is to set environment variables.
In particular, I added the following lines to ~/.bash_profile:
export SPARK_HOME="/Users/user/Applications/spark"
export PYSPARK_SUBMIT_ARGS="--master local[2]"
# Make pyspark available anywhere
export PATH="$SPARK_HOME/bin:$PATH"
Remark: spark-shell (i.e. the scala based REPL) should also be accessible at this point.
Test
Now, it is time fore testing.
Start a new terminal session, or source /.bash_profile.
Now, simply run pyspark.
You should get a Python REPL console with the SparkContext already loaded as sc.
Tryout:
x = sc.parallelize([1,2,3])
and create a simple RDD.
Use IPython and Jupyter
The standard Python REPL is, somehow, crappy.
You probably want to use IPython and even better Jupyter.
To that end, I added the following to my ./bash_profile:
export PYSPARK_PYTHON=/Users/drorata/anaconda3
export PYSPARK_DRIVER_PYTHON=/Users/drorata/anaconda3/bin/ipython
alias pysparknb='PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark'
This way, whenever I invoke pyspark a nice IPython console is started.
In addition pysparknb starts a Jupyter server in the current directory.