Thursday, September 13, 2018

Estimating Pi ~3.141.. using PySpark via Jupyter!!


Hello...Following steps will make your Spark cluster set up and use "pyspark" Spark application via Jupyter!!
  • Download and Install Spark:
         Spark can be downloaded here. https://spark.apache.org/downloads.html 
         $tar zxvf [Spark tar ball] 
  • Start Master and Worker nodes  
         cd ~/spark-2.3.1-bin-hadoop2.7/sbin/
         ./start-master.sh
         ./start-slave.sh spark://ubuntu:7077

         Open a browser to check Spark UI is up as below 
         http://IP_SPARK_CLUSTER:8080/

  • Install and setup Jupyter (assuming you have pip installed)
          sudo apt-get -y install ipython ipython-notebook
          sudo -H pip install jupyter
          sudo -H pip install --upgrade pip  
          sudo -H pip install jupyter 
  • Update ~/.bashrc with ENV variables for Jupyter        
          export PYSPARK_DRIVER_PYTHON=jupyter
          export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

          Other variables are expected as well.
          export SPARK_HOME=/PATH_TO/spark-2.3.1-bin-hadoop2.7
  • Launch Spark-UI, Jupyter
          pyspark --master spark://ubuntu:7077
  • Now let's estimate Pi ..
          import pyspark
          import random
          if not 'sc' in globals():
               sc = pyspark.SparkContext()
         NUM_SAMPLES = 100000000
         def sample(p):
              x,y = random.random(),random.random()
         return 1 if x*x + y*y < 1 else 0
         count = sc.parallelize(xrange(0, NUM_SAMPLES)) \
                                          .map(sample) \
                                          .reduce(lambda a, b: a + b)
         print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)


Quick Demo: