- First, download Spark.
I downloaded the binaries given as
- You'd also need to install the JDK; I took it from here
Set and test
Simply installing Spark is simple.
All you have to do is extract the archive.
I placed the extracted binaries in
~/Applications, resulting in the following tree structure:
~/Applications └── spark-2.1.0-bin-hadoop2.7
At this point, you can already run Spark.
For simplifying the settings, I created a symbolic link to
~/Applications/spark, in particular using
ln -s spark-2.1.0-bin-hadoop2.7 spark, yielding the following structure:
~/Applications ├── spark -> spark-2.1.0-bin-hadoop2.7 └── spark-2.1.0-bin-hadoop2.7
Next step, is to set environment variables.
In particular, I added the following lines to
export SPARK_HOME="/Users/user/Applications/spark" export PYSPARK_SUBMIT_ARGS="--master local" # Make pyspark available anywhere export PATH="$SPARK_HOME/bin:$PATH"
spark-shell (i.e. the scala based REPL) should also be accessible at this point.
Now, it is time fore testing.
Start a new terminal session, or source
Now, simply run
You should get a Python REPL console with the
SparkContext already loaded as
x = sc.parallelize([1,2,3])
and create a simple RDD.
The standard Python REPL is, somehow, crappy.
You probably want to use
IPython and even better
To that end, I added the following to my
export PYSPARK_PYTHON=/Users/drorata/anaconda3 export PYSPARK_DRIVER_PYTHON=/Users/drorata/anaconda3/bin/ipython alias pysparknb='PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark'
This way, whenever I invoke
pyspark a nice IPython console is started.
pysparknb starts a Jupyter server in the current directory.