Friday, March 28, 2014

Mahout trunk under Eclipse

Here tried to capture few details, on how to integrate Mahout with Eclipse over Mac OS?


############Get MAHOUT src for eclipse Integration#########
> cd [place where you got Eclipse workspace]
> svn co http://svn.apache.org/repos/asf/mahout/trunk
> mvn eclipse:eclipse
> #######Set $PATH in .bash_profile in "home"
echo $HADOOP_HOME
/Users/shota/Downloads/hadoop-1.2.1

echo $MAHOUT_HOME
/Users/shota/Downloads/mahout-distribution-0.9
In my case going to OVERWRITE to "trunk" physical location under ".bash_profile". Check to make sure point to correct Mahout home now?

echo $MAHOUT_HOME
/Users/shota/Documents/workspace/trunk

> mvn install                  [ie can take a while]
:
:
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Mahout Build Tools ................................ SUCCESS [1.526s]
[INFO] Apache Mahout ..................................... SUCCESS [0.340s]
[INFO] Mahout Math ....................................... SUCCESS [54.567s]
[INFO] Mahout Core ....................................... SUCCESS [11:04.491s]
[INFO] Mahout Integration ................................ SUCCESS [1:11.533s]
[INFO] Mahout Examples ................................... SUCCESS [15.172s]
[INFO] Mahout Release Package ............................ SUCCESS [0.011s]
[INFO] Mahout Math/Scala wrappers ........................ SUCCESS [34.278s]
[INFO] Mahout Spark bindings ............................. SUCCESS [2:00.803s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 16:03.167s
[INFO] Finished at: Fri Mar 28 09:47:43 EDT 2014
[INFO] Final Memory: 65M/1086M
[INFO] ------------------------------------------------------------------------
################TEST MAHOUT works?
> mahout
:
:
  transpose: : Take the transpose of a matrix
  validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set
  vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors
  vectordump: : Dump vectors from a sequence file to text
  viterbi: : Viterbi decoding of hidden states from given output states sequence

Under Eclipse:
File > Import > Existing Maven Projects > [Browse where you got mahout trunk installed]

Once Finish, you need to restart Eclipse. Then you are good to go!






Monday, March 24, 2014

Riding on Mahout for Recommender

How to use Mahout's recommender? How to analyze recommended output?
How it works on big data sets? Let's take a quick ride on Mahout's recommendation engine.

Mahout's got recommendation algorithm named "recommenditembased" as "recommend-item-based" in their ML algorithms stack. Tried in OS-X Mac for a fairly large data set of 100k and then another massive stream of 1M movie ratings from GroupLens [http://grouplens.org/datasets/movielens/]. Initially run the small and then run the big.

Input Format (CSV - 3 columns only):
User(#ID), Movie (#ID), Rating (Scale 1-5)

i.e: The details of User and Movie are in separate files with demographics or extra information.

Once downloaded Mahout, you can simple go to bin diretory and run below commands. For simplicity, created a shell script[recommend-commands.sh] which takes into Mahout (bin) and then run these commands. You can uncomment the ones you don't need before running your script.

recommend-commands.sh
cd /Users/shota/Downloads/mahout-distribution-0.9/bin
rm -rf ./temp ./output
#mahout recommenditembased --input /Users/shota/Downloads/ml-100k/u.data-noTS.csv --output output/ --similarityClassname SIMILARITY_PEARSON_CORRELATION
#mahout recommenditembased --input /Users/shota/Downloads/ml-100k/u.data-noTS.csv --output output/ --similarityClassname SIMILARITY_COSINE
#mahout recommenditembased --input /Users/shota/Downloads/ml-100k/u.data-noTS.csv --output output/ --similarityClassname SIMILARITY_LOGLIKELIHOOD
#mahout recommenditembased --input /Users/shota/Downloads/ml-100k/u.data-noTS.csv --output output/ --similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE

#Million Ratings
#mahout recommenditembased --input /Users/shota/Downloads/ml-1m/ratings.csv --output output/ --similarityClassname SIMILARITY_COSINE
mahout recommenditembased --input /Users/shota/Downloads/ml-1m/ratings.csv --output output/ --similarityClassname SIMILARITY_LOGLIKELIHOOD

You can see similarity can be provided in many ways via Mahout's engine. You can check the output of each of them and them compare which best fits your dataset.

Once this runs, then you can get into "outout" directory and then capture the file nam as "part-r-00000". This is the recommender o/p.






Output:
shota$ head -5 part-r-00000 
1 [1560:5.0,737:5.0,69:5.0,1351:5.0,1194:5.0,481:5.0,474:5.0,68:5.0,449:5.0,25:5.0]
2 [462:5.0,347:5.0,515:5.0,895:5.0,234:5.0,282:5.0,129:5.0,88:5.0,237:5.0,121:5.0]
3 [137:5.0,285:5.0,654:5.0,693:5.0,531:5.0,124:5.0,508:5.0,129:5.0,150:5.0,47:5.0]
4 [282:5.0,121:5.0,895:5.0,234:5.0,275:5.0,690:5.0,1238:5.0,237:5.0,814:5.0,255:5.0]
5 [582:5.0,403:5.0,47:5.0,156:5.0,237:5.0,67:5.0,1016:5.0,608:5.0,128:5.0,276:5.0]

Once you get that you can map the User_ID[1,2,3..] and the recommended movies like with ratings separated as [Movie_ID:Rating]. That's quite to useful to know.

Also when i tried 1M data sets, it's fairly fast..just it took less than a minute to process results using the MapReduce framework in built with it. Kinda cool!

Other Info:
echo $HADOOP_HOME
/Users/shota/Downloads/hadoop-1.2.1

echo $MAHOUT_HOME
/Users/shota/Downloads/mahout-distribution-0.9

[2.2.0 is NOT fully compatible w/ Mahout 0.9 at this moment - Mar 28, 2014]

Related Readings on recommendation:
Short paper: http://infolab.stanford.edu/~ullman/mmds/ch9.pdf
Amazon.com Recommendations  http://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf