Wednesday, March 14, 2018

PI day via Spark and Pentaho Data integration with SparkML (Java & Python)!!

This blog to show, how you can use Hitachi's PDI [Pentaho Data Integration] to submit Spark jobs for machine learning using Java and Python libraries. Also the pi is calculated via a Spark submit, task for you to locate that :)


Below snapshots are using Spark's Java and Python Machine learning algorithms.
OS used: Ubuntu 16.04
Tool: Spoon 7.1 a.k.a Pentaho Data Integration (Design Tool)

Install Spark:
tar zxvf spark-2.1.0-bin-hadoop2.7.tgz

Start Master and Slave:










Browse to check spark session is up at port=8080!? 










Launch PDI Job (Spark - Java ML Library):












Submitting Spark Python ML job via PDI:

Install Python Libraries - 











PDI job to submit Spark-Python (Showing Entry), K-Nearest Neighbor run for various seed values and analytics: