Data with insights, don't forget your coffee!!: 2018

Thursday, September 13, 2018

Estimating Pi ~3.141.. using PySpark via Jupyter!!

Hello...Following steps will make your Spark cluster set up and use "pyspark" Spark application via Jupyter!!

Download and Install Spark:

Spark can be downloaded here. https://spark.apache.org/downloads.html

$tar zxvf [Spark tar ball]

Start Master and Worker nodes

cd ~/spark-2.3.1-bin-hadoop2.7/sbin/

./start-master.sh

./start-slave.sh spark://ubuntu:7077

Open a browser to check Spark UI is up as below

http://IP_SPARK_CLUSTER:8080/

Install and setup Jupyter (assuming you have pip installed)

sudo apt-get -y install ipython ipython-notebook

sudo -H pip install jupyter

sudo -H pip install --upgrade pip

sudo -H pip install jupyter

Update ~/.bashrc with ENV variables for Jupyter

export PYSPARK_DRIVER_PYTHON=jupyter

export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Other variables are expected as well.

export SPARK_HOME=/PATH_TO/spark-2.3.1-bin-hadoop2.7

Launch Spark-UI, Jupyter

pyspark --master spark://ubuntu:7077

Now let's estimate Pi ..

import pyspark

import random

if not 'sc' in globals():

sc = pyspark.SparkContext()

NUM_SAMPLES = 100000000

def sample(p):

x,y = random.random(),random.random()

return 1 if x*x + y*y < 1 else 0

count = sc.parallelize(xrange(0, NUM_SAMPLES)) \

.map(sample) \

.reduce(lambda a, b: a + b)

print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)

Quick Demo:

PySpark in a Ubuntu VM - Estimating Pi

Wednesday, May 30, 2018

PDI's (Pentaho Data Integration) - Plugin Machine Intelligence... Supervise the supervised learning!!!

Hello!!

Recently was going over PDI 8.1 (Spoon client) and was exciting to see an expanded "Data Mining" Directory. It has individual PDI steps for algorithms like (Support vector, Decision tree, Naive Bayes, Logistic regression, Boosting) and few more. These algorithms are part of Plugin Machine Intelligence (PMI), which is based in 4 core engines (Weka, Python, R, Spark) from associated machine learning libraries (Weka, Scikit-Learn, MLR, MLlib) respectively. PMI can be downloadable from PDI Market Place.

The development and testing time has drastically decreased, while adhering to CRISP-DM [Cross-industry standard process for data mining]. Below are the normal trend in any data science project.

1. Engineer Features

2. Train Model

3. Test Model

4. Deploy Model (Repeat the process periodically)

With PDI (aka Spoon) in hand, now you can make use of the steps above with faster feedback loop to address a data science problem. Will show a demo example and how you can use these new steps within Spoon.

Plugin Machine Intelligence:

Below is the list of algorithms offering. Right side job, works on PIMA diabetes downloadable from Weka into local file system in CSV (also prepared a separate file for testing). "Train - via KFML Model" is described in one of my older blog (please refer). Here will focus into PMI here "Train - via PMI Classifier" and "Score Model" transformations.

Classifiers:

Once you open any of the PMI based classifier, you will get 5tabs [Configure, Fields, Algorithm config, Preprocessing, Evaluation]. Shown some exposures to couple of those tabs and you can parametric the directory of models, where you store them.

You may use Evaluation tab to pick which type (CV, Percent Split, Separate Test set), here used 10 Folds CV across all algorithms.

Scoring:

Once the models are created, you can use those models for prediction on test data through "PMI Scoring" component. At one go, you can use the multiple algorithm based models to score!!

Analysis:

Once you have collected results into a database, then you can have Pentaho Business Analytics to visualize that PAZ (Pentaho Analyzer) reports.

Hits: It's clear that SVM matched (Winner) among these algorithms on this data.

Miss: It's clear that SVM just missed one, also J48 decision tree is next in line though it missed a few (this was with default parameters). Both Naive Bayes, Logistic regression missed bunch to classify this data.

Then tried to add parameter for NaiveBayes to useSupervizedDiscretization under "Algorithm Config" tab. Re-ran experiment, this time it decreased the misses (see below NB bar)!

From File system perspective:

Conclusion:

With Plugin Machine Intelligence (PMI) from PDI, you can easily download and use your favorite algorithms to train/test/build while maintaining faster execution and adhering CRISP-DM practice!!

Wednesday, March 14, 2018

PI day via Spark and Pentaho Data integration with SparkML (Java & Python)!!

This blog to show, how you can use Hitachi's PDI [Pentaho Data Integration] to submit Spark jobs for machine learning using Java and Python libraries. Also the pi is calculated via a Spark submit, task for you to locate that :)

Below snapshots are using Spark's Java and Python Machine learning algorithms.
OS used: Ubuntu 16.04
Tool: Spoon 7.1 a.k.a Pentaho Data Integration (Design Tool)

Install Spark:
tar zxvf spark-2.1.0-bin-hadoop2.7.tgz

Start Master and Slave: