Hello...Following steps will make your Spark cluster set up and use "pyspark" Spark application via Jupyter!!
- Download and Install Spark:
Spark can be downloaded here. https://spark.apache.org/downloads.html
$tar zxvf [Spark tar ball]
- Start Master and Worker nodes
cd ~/spark-2.3.1-bin-hadoop2.7/sbin/
./start-master.sh
./start-slave.sh spark://ubuntu:7077
Open a browser to check Spark UI is up as below
http://IP_SPARK_CLUSTER:8080/
- Install and setup Jupyter (assuming you have pip installed)
sudo apt-get -y install ipython ipython-notebook
sudo -H pip install jupyter
sudo -H pip install --upgrade pip
sudo -H pip install jupyter
- Update ~/.bashrc with ENV variables for Jupyter
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Other variables are expected as well.
export SPARK_HOME=/PATH_TO/spark-2.3.1-bin-hadoop2.7
- Launch Spark-UI, Jupyter
pyspark --master spark://ubuntu:7077
- Now let's estimate Pi ..
import pyspark
import random
if not 'sc' in globals():
sc = pyspark.SparkContext()
NUM_SAMPLES = 100000000
def sample(p):
x,y = random.random(),random.random()
return 1 if x*x + y*y < 1 else 0
count = sc.parallelize(xrange(0, NUM_SAMPLES)) \
.map(sample) \
.reduce(lambda a, b: a + b)
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
Quick Demo: