Thursday, March 31, 2016

REST API via Kettle PDI and Python



As more and more usage of API services are emerging, from developer tool box, here listing out a scenario where you can access Restaurant information [name, phone] from Locu services
https://locu.com/

Q: How to get API response via PDI [Pentaho Data Integration] and process ETL?

before opening Spoon, you need to check whether there is a developer site for this services? If there is one, this case developer.locu.com ; You need to create an account and retrieve you "API KEY". Then once you get API key, you can use that to search your City and look for results (pic in left), this case it give you JSON [Java Script Object Notation] result. Also you can find URL, if you use the URL in browser then you can get JSON o/p printed into your browser, like given below (right).




Once you see JSON o/p in browser, open Spoon and create your transformation as below. Then you can use either a Text File Output, Or move into DB as target step [Not captured]. The string operations are needed to replace your supplied City from defined transformation parameter. Then if there is a space between two words in city name, "New York", then it replaces with "%20" to parse proper.
















Q. How to do this Python way?

#!/usr/bin/env python

import urllib2
import json

locu_api = 'YOUR_API'
#url = 'https://api.locu.com/v1_0/venue/search/?locality=Newport%20Beach&api_key=YOUR_API'

def locu_search(query):
    api_key= locu_api
    url = 'https://api.locu.com/v1_0/venue/search/?api_key=' + api_key
    locality = query.replace(' ', '%20')
    final_url = url + "&locality=" + locality + "&category=restaurant"
    json_obj = urllib2.urlopen(final_url)
    data = json.load(json_obj)
 
    for item in data['objects']:

        print item['name'], item['phone']

##########Ends Here##################
Hope you liked it...have a great day!

Wednesday, March 16, 2016

Pentaho Data Integration (Kettle) with Weka - Predicting Diabetes [Prima]



This blog penning down to use Pentaho's Data Integration [PDI or Kettle] to connect with Weka to build a predictive model for diabetes prediction. 

Wrapped couple of PDI transformations [TF] into the below job [JB]. Job is a workflow and transformation is a data flow in Pentaho's perspective. First TF is creating predictive model from a training file. Another TF is scoring based on the model on un-seen data.


Tools:

PDI 5.x
Weka 3.7.11
Train File
Test File

jb_build_n_score






tf_build_J48Model
This transformation is creating a Weka's knowledge flow and reference to that file "J48_ModelSaver.kfml" is present under left window "Load/import Knowledge Flow". Use the check box ON for "Inject data into KnowledgeFlow", followed by the step name as KettleInject. You must downloaded the package under your Weka [See it's not used ARFF reader - making a difference here that you are in Kettle or PDI]. Set your "Class attribute" as displayed in the right most window. When you click on "Show embedded KnowledgeFlow editor", then third tab [here screen shot] will open.










Knowledge Flow shows within PDI:
You should first make this KnowledgeFlow in Weka [via ARFF input] and then after making it to work without error, change the input to "KettleInject"





Scoring:
Once the model is generated, you can provide a test sample to "Weka Scoring" component, to use the model for prediction. You have labeled already your data and now you want to predict based on model. The below window you are supplying the binary model path, created in the above transformation.





J48 tree under "Model" tab:



CSV File Source [Header = @attributes lines ] http://www2.mta.ac.il/~gideon/courses/data_mining/diabetes.arff 

Prediction Phase: 
File is generated from "Predict -Diabetes" step in Scoring TF. When import into excel and use the function for Col-K  >>  Red ones are incorrectly predicted. This case accuracy = 70.96%





Hope you enjoyed the ride! 
Any question, feel free to post a reply.

PS: Next in series [PDI+Weka], An un-structured text mining problem solving..