Thursday, May 19, 2016

Sentiment analysis on Movie reviews - Just PDI - Pentaho Data Integration (Weka) !!


Sentiment Analysis: 

No more buzz word, it's in practice with many organizations now!! Capturing sentiments from un-structured text data is becoming a practice since last few years. In this blog the focus is more on technology using Pentaho stack of tools. Pentaho's DI [aka Kettle open source] has "Data Mining" components, which uses Weka knowledge flow, scoring component etc.

Here will show how you can import un-structured data, build a machine learning model, score on a model, all but using Pentaho tools.

Movie Reviews Data Set:  http://www.cs.cornell.edu/people/pabo/movie-review-data/

Weka's TextDirectoryLoader : It's CLI based and can be used to import text files into an ARFF file for Weka's analysis [model building etc]. Pretty sleek [fast]!!


Once imported into an ARFF, a single document looks like below. [Textual Data and tagged sentiment]











Once in ARFF, you can open in Weka "Explorer" to clean punctuations,  performing NLP Features [n-grams, stop words, tf, idf etc]. StringToWordVector class got bunch of options of performing NLP feature selection. You can do "term frequency", "inverse document frequency", by using stop-words ex: removing usage of [films Vs film], bi-grams, tri-grams etc













PDI (Kettle) Job:
The job has two transformations to execute. You need to design in "Spoon"
1. Knowledge Flow [Build Model where you can choose Support Vector Machine, Decision Tree, Naive Bayes etc]
2. Scoring Component [ Based on model, supply test test for finding out how well was the sentiment classification done?]



Knowledge Flow (To Build Model):
Once you open, knowledge flow model building component, you can supply the path where the knowledge flow is present. The right side is the Weka knowledge flow.



Once you open, Filter Classifier step within Weka Knowledge flow, you can see how it's using Classifier (ie learner) and Filter (StringToWordVector) together. Then the last step in Knowledge flow is "Serialized Model Saver", saves the J48 into a binary file format, which will be used in scoring.

PS: Here the source is text file [unstructured reviews with sentiment tagged] with movie reviews [No ARFF as using KettleInject step within KT Flow]










Scoring:
ow you are ready to supply test data and use the model saved under "Serialized Model Saver" step previously. The "Model" tab embeds the decision tree in this case.



Outcome:

Enterprise Need: 
So if you have an un-structured data and looking for classification on that data set, all you need is Pentaho Data Integration[PDI=Kettle]. This is how you can build a model, score it, classify/predict it, take action on it. Most of these tasks can be done via workflow [set of jobs]. You can schedule this as per need and then more over display predicted data in a dashboard for analysis. 


Hope you enjoyed this approach. Thank you!