Data with insights, don't forget your coffee!!: 2016

Thursday, October 27, 2016

How to find Pentaho Server [BA or DI] version?

Found following ways...

1 - Visually inspect the file system in server.
#DI Server
ls -l /home/pentaho/data-integration-server/tomcat/webapps/pentaho-di/WEB-INF/lib/kettle*
#BA Server
ls -l /home/pentaho/biserver-ee/tomcat/webapps/pentaho/WEB-INF/lib/kettle*

2 - Transformation way - "Call endpoint" step

Saturday, September 24, 2016

Fun working with PDI - Pentaho Data Integration "Project.Properties" !!

It'a always fun with PDI and especially when you put fingers on qwerty for something cool!!

Here will show you, how to define connection via "project.properties" [ie foo.properties] and not fetching those information from kettle.properties [those who are familiar with Pentaho's Data Integration]. This is pretty much useful and always a best practice in ETL development, when you are working on multiple projects and working in many connections.

Directory Structure: Below is C:\pentaho_projects

Demo Job Snapshot:

Executing one transformation to set these above entries from foo.properties and another transformation using those to retrieve a database table. Sweet Aha :)

Now once the Setting variables is set you can use in creating a database connection and that is being used under "fetch employee info".

Fetch Employee Info Transformation using "foo_Project" conncetion

Link: Download to Try

Thursday, May 19, 2016

Sentiment analysis on Movie reviews - Just PDI - Pentaho Data Integration (Weka) !!

Sentiment Analysis:

No more buzz word, it's in practice with many organizations now!! Capturing sentiments from un-structured text data is becoming a practice since last few years. In this blog the focus is more on technology using Pentaho stack of tools. Pentaho's DI [aka Kettle open source] has "Data Mining" components, which uses Weka knowledge flow, scoring component etc.

Here will show how you can import un-structured data, build a machine learning model, score on a model, all but using Pentaho tools.

Movie Reviews Data Set: http://www.cs.cornell.edu/people/pabo/movie-review-data/

Weka's TextDirectoryLoader : It's CLI based and can be used to import text files into an ARFF file for Weka's analysis [model building etc]. Pretty sleek [fast]!!

Once imported into an ARFF, a single document looks like below. [Textual Data and tagged sentiment]

Once in ARFF, you can open in Weka "Explorer" to clean punctuations, performing NLP Features [n-grams, stop words, tf, idf etc]. StringToWordVector class got bunch of options of performing NLP feature selection. You can do "term frequency", "inverse document frequency", by using stop-words ex: removing usage of [films Vs film], bi-grams, tri-grams etc

PDI (Kettle) Job:
The job has two transformations to execute. You need to design in "Spoon"
1. Knowledge Flow [Build Model where you can choose Support Vector Machine, Decision Tree, Naive Bayes etc]
2. Scoring Component [ Based on model, supply test test for finding out how well was the sentiment classification done?]

Knowledge Flow (To Build Model):
Once you open, knowledge flow model building component, you can supply the path where the knowledge flow is present. The right side is the Weka knowledge flow.

Once you open, Filter Classifier step within Weka Knowledge flow, you can see how it's using Classifier (ie learner) and Filter (StringToWordVector) together. Then the last step in Knowledge flow is "Serialized Model Saver", saves the J48 into a binary file format, which will be used in scoring.

PS: Here the source is text file [unstructured reviews with sentiment tagged] with movie reviews [No ARFF as using KettleInject step within KT Flow]

Scoring:
ow you are ready to supply test data and use the model saved under "Serialized Model Saver" step previously. The "Model" tab embeds the decision tree in this case.

Outcome:

Enterprise Need:

So if you have an un-structured data and looking for classification on that data set, all you need is Pentaho Data Integration[PDI=Kettle]. This is how you can build a model, score it, classify/predict it, take action on it. Most of these tasks can be done via workflow [set of jobs]. You can schedule this as per need and then more over display predicted data in a dashboard for analysis.

Download code : https://github.com/shotapentaho/pdi-weka

Hope you enjoyed this approach. Thank you!

Thursday, March 31, 2016

REST API via Kettle PDI and Python

As more and more usage of API services are emerging, from developer tool box, here listing out a scenario where you can access Restaurant information [name, phone] from Locu services
https://locu.com/

Q: How to get API response via PDI [Pentaho Data Integration] and process ETL?

before opening Spoon, you need to check whether there is a developer site for this services? If there is one, this case developer.locu.com ; You need to create an account and retrieve you "API KEY". Then once you get API key, you can use that to search your City and look for results (pic in left), this case it give you JSON [Java Script Object Notation] result. Also you can find URL, if you use the URL in browser then you can get JSON o/p printed into your browser, like given below (right).

Once you see JSON o/p in browser, open Spoon and create your transformation as below. Then you can use either a Text File Output, Or move into DB as target step [Not captured]. The string operations are needed to replace your supplied City from defined transformation parameter. Then if there is a space between two words in city name, "New York", then it replaces with "%20" to parse proper.

Q. How to do this Python way?

#!/usr/bin/env python

import urllib2
import json

locu_api = 'YOUR_API'
#url = 'https://api.locu.com/v1_0/venue/search/?locality=Newport%20Beach&api_key=YOUR_API'

def locu_search(query):
api_key= locu_api
url = 'https://api.locu.com/v1_0/venue/search/?api_key=' + api_key
locality = query.replace(' ', '%20')
final_url = url + "&locality=" + locality + "&category=restaurant"
json_obj = urllib2.urlopen(final_url)
data = json.load(json_obj)

for item in data['objects']:

print item['name'], item['phone']

##########Ends Here##################
Hope you liked it...have a great day!

Wednesday, March 16, 2016

Pentaho Data Integration (Kettle) with Weka - Predicting Diabetes [Prima]

This blog penning down to use Pentaho's Data Integration [PDI or Kettle] to connect with Weka to build a predictive model for diabetes prediction.

Wrapped couple of PDI transformations [TF] into the below job [JB]. Job is a workflow and transformation is a data flow in Pentaho's perspective. First TF is creating predictive model from a training file. Another TF is scoring based on the model on un-seen data.

Tools:
PDI 5.x
Weka 3.7.11
Train File
Test File

jb_build_n_score

tf_build_J48Model
This transformation is creating a Weka's knowledge flow and reference to that file "J48_ModelSaver.kfml" is present under left window "Load/import Knowledge Flow". Use the check box ON for "Inject data into KnowledgeFlow", followed by the step name as KettleInject. You must downloaded the package under your Weka [See it's not used ARFF reader - making a difference here that you are in Kettle or PDI]. Set your "Class attribute" as displayed in the right most window. When you click on "Show embedded KnowledgeFlow editor", then third tab [here screen shot] will open.

Knowledge Flow shows within PDI:
You should first make this KnowledgeFlow in Weka [via ARFF input] and then after making it to work without error, change the input to "KettleInject"

Scoring:
Once the model is generated, you can provide a test sample to "Weka Scoring" component, to use the model for prediction. You have labeled already your data and now you want to predict based on model. The below window you are supplying the binary model path, created in the above transformation.

J48 tree under "Model" tab:

CSV File Source [Header = @attributes lines ] : http://www2.mta.ac.il/~gideon/courses/data_mining/diabetes.arff

Prediction Phase:
File is generated from "Predict -Diabetes" step in Scoring TF. When import into excel and use the function for Col-K >> Red ones are incorrectly predicted. This case accuracy = 70.96%

Hope you enjoyed the ride!
Any question, feel free to post a reply.

PS: Next in series [PDI+Weka], An un-structured text mining problem solving..