Thursday, June 08, 2017

Data Science view - Mahout Recommender using Pentaho Integration and Analytics!!

Recommender System design is nothing very new, but still a majority of companies are struggling in this cutting edge e-commerce market. Below is an article came in WSJ recently.



The right side chart depicts the popularity in Y-axis and Items in X-axis, with physical stores they can just keep highly popular items, but bottom of curve are also majority of items, which only online stores can supply and those few have built a recommender system are the winners. This kind of examples are Amazon, Netflix, Pandora, e-Harmony etc.

In this blog, am used Pentaho Data Integration (PDI) aka Spoon data integration design tool to call Apache Mahout based Collaborative Filtering algorithm. Then parsed recommended results into more structured data and then later moved that data into a relational data to analyze via Pentaho Business Analytics.

Movie Lens Data: https://grouplens.org/datasets/movielens/



Jobs and Transformations:

The job "exec_mahout_similarities" calls the shown shell script from "job_recommend_user_ratings" [shown in side]. Then the job "job_process_recommended_output" is called and parameter driven to process any similarity measure (Cosine, Pearson Correlation, Spearman Correlation,Euclidean Distance, Tanimoto Coefficient, LogLikelihood). Here it's Cosine and LLH is used for analysis.

Once recommended ouput is parsed into CSV, then below transformation to get metadata information about user and movie. Later the last phase in the main job, it calls another transformation to load data into a relational table.

Analysis:

Using DSW[Data Source Wizard], connected to each table and created Analysis model to build Pentaho Analyzer reports [PAZ].













Occupation with Movie Count [recommended = 5 rating] - Both Algorithms behave similar, suggesting Students for majority movies. Doctors, Lawyers are the least.















Top10 Movies from recommended (both similarity measures):

Conclusion: Pentaho is used to build Mahout based Collaborative Filtering recommendation data pipe line and also business analytics for users like ie data scientist to help analyze data for decision making for enterprise!