Wednesday, May 30, 2018

PDI's (Pentaho Data Integration) - Plugin Machine Intelligence... Supervise the supervised learning!!!


Hello!!

Recently was going over PDI 8.1 (Spoon client) and was exciting to see an expanded "Data Mining" Directory. It has individual PDI steps for algorithms like (Support vector, Decision tree, Naive Bayes, Logistic regression, Boosting) and few more. These algorithms are part of Plugin Machine Intelligence (PMI), which is based in 4 core engines (Weka, Python, R, Spark) from associated machine learning libraries (Weka, Scikit-Learn, MLR, MLlib) respectively. PMI can be downloadable from PDI Market Place.




The development and testing time has drastically decreased, while adhering to CRISP-DM [Cross-industry standard process for data mining]. Below are the normal trend in any data science project. 

1. Engineer Features 
2. Train Model 
3. Test Model 
4. Deploy Model (Repeat the process periodically)

With PDI (aka Spoon) in hand, now you can make use of the steps above with faster feedback loop to address a data science problem. Will show a demo example and how you can use these new steps within Spoon.

Plugin Machine Intelligence:

Below is the list of algorithms offering. Right side job, works on PIMA diabetes downloadable from Weka into local file system in CSV (also prepared a separate file for testing). "Train - via KFML Model" is described in one of my older blog (please refer). Here will focus into PMI here "Train - via PMI Classifier" and "Score Model" transformations.


Classifiers:
Once you open any of the PMI based classifier, you will get 5tabs [Configure, Fields, Algorithm config, Preprocessing, Evaluation]. Shown some exposures to couple of those tabs and you can parametric the directory of models, where you store them.

You may use Evaluation tab to pick which type (CV, Percent Split, Separate Test set), here used 10 Folds CV across all algorithms.

Scoring:
Once the models are created, you can use those models for prediction on test data through "PMI Scoring" component. At one go, you can use the multiple algorithm based models to score!!

Analysis:

Once you have collected results into a database, then you can have Pentaho Business Analytics to visualize that PAZ (Pentaho Analyzer) reports.

Hits: It's clear that SVM matched (Winner) among these algorithms on this data.


Miss: It's clear that SVM just missed one, also J48 decision tree is next in line though it missed a few (this was with default parameters). Both Naive Bayes, Logistic regression missed bunch to classify this data.


Then tried to add parameter for NaiveBayes to useSupervizedDiscretization under "Algorithm Config" tab. Re-ran experiment, this time it decreased the misses (see below NB bar)!





From File system perspective:




Conclusion:

With Plugin Machine Intelligence (PMI) from PDI, you can easily download and use your favorite algorithms to train/test/build while maintaining faster execution and adhering CRISP-DM practice!!