Monday, December 22, 2014

NamedEntity via Stanford NER with PhP

Here is a snipped to get NER for a given sentence using PhP and powered by Stanford NER http://nlp.stanford.edu/software/CRF-NER.shtml

Step0: Download Stanford-NER in above URL and unzip in a location of your choice
Step1: Download PHP-Stanford-NLP-master.zip from https://github.com/agentile/PHP-Stanford-NLP
Step2: Unzip PHP-Stanford-NLP-master.zip under ~/Sites/
Step3: Store below code test_NER.php under ~/Sites/
Step4: Open browser and provide URL as localhost/~user/

#Code to perform Named Entity using StanfordNamedEntity Recogniser
#Author Hota S

require './PHP-Stanford-NLP-master/src/StanfordNLP/Base.php';
require './PHP-Stanford-NLP-master/src/StanfordNLP/Exception.php';
require './PHP-Stanford-NLP-master/src/StanfordNLP/Parser.php';
require './PHP-Stanford-NLP-master/src/StanfordNLP/StanfordTagger.php';
require './PHP-Stanford-NLP-master/src/StanfordNLP/NERTagger.php';
require './PHP-Stanford-NLP-master/src/StanfordNLP/POSTagger.php';

$pos = new \StanfordNLP\NERTagger('/Users/XXXX/Downloads/stanford-ner-2014-10-26/classifiers/english.all.3class.distsim.crf.ser.gz','/Users/XXXX/Downloads/stanford-ner-2014-10-26/stanford-ner-3.5.0.jar');

$a_str = "U.S. stocks rose in a generally quiet session, sending the Dow Jones Industrial Average and S&P 500 to fresh record highs, their 35th and 50th of the year, respectively, and setting up the Dow just a stone’s throw from the 18000 mark.Meanwhile, crude oil resumed its slide after a brief respite.WTI crude fell more than 3%, logging in its second-worst close this year.The Federal Reserve Bank of New York led by Timothy R. Geithner.";

$result = $pos -> tag(explode(' ', $a_str));
$output="";
$max = sizeof($result);
for ($row = 0; $row<$max;$row++) {
   for ($col = 0; $col < 2; $col++) {
    $output = $output."_";
    $output = $output.$result[$row][$col];
  }
$output=$output. " "; 
}

$output = str_replace(" _"," ",$output);
$output = substr($output, 1);
echo $output."\n";

#var_dump($result);
?>

###########OUTPUT###################

U.S._LOCATION stocks_O rose_O in_O a_O generally_O quiet_O session_O ,_O sending_O the_O Dow_ORGANIZATION Jones_ORGANIZATION Industrial_O Average_O and_O S&P_ORGANIZATION 500_O to_O fresh_O record_O highs_O ,_O their_O 35th_O and_O 50th_O of_O the_O year_O ,_O respectively_O ,_O and_O setting_O up_O the_O Dow_O just_O a_O stone_O 's_O throw_O from_O the_O 18000_O mark.Meanwhile_O ,_O crude_O oil_O resumed_O its_O slide_O after_O a_O brief_O respite.WTI_O crude_O fell_O more_O than_O 3_O %_O ,_O logging_O in_O its_O second-worst_O close_O this_O year.The_O Federal_ORGANIZATION Reserve_ORGANIZATION Bank_ORGANIZATION of_ORGANIZATION New_ORGANIZATION York_ORGANIZATION led_O by_O Timothy_PERSON R._PERSON Geithner_PERSON ._O

Tuesday, November 04, 2014

Connect RedShift via Python's [psycopg2]

Here is a way to connect to RedShift DB using Python library [psycopg2].
psycopg2: A Python-PostgreSQL Database Adapter

#!/usr/bin/python
import psycopg2
import sys
import pprint
from datetime import date, timedelta

#Connect to RedShift
conn_string = "dbname='DBNAME' port='5439' user='USER' password='PWD' host='REDSHIFT_INSTANCE_NAME.redshift.amazonaws.com'";
print "Connecting to database\n        ->%s" % (conn_string)
conn = psycopg2.connect(conn_string);

cursor = conn.cursor();

#Captures Column Names 
column_names = [];
cursor.execute("Select * from SCHEMA_NAME.TABLE_NAME limit 0;");
column_names = [desc[0] for desc in cursor.description]
all_cols=', '.join([str(x) for x in column_names])
print all_cols;

#NR for this - argument under timedelta can be taken as  int(str(sys.argv[1]))
yest = date.today() - timedelta(1);
yest_str= yest.strftime('%Y-%m-%d');
print "Yesterday was\n        ->%s" % (yest_str)

conn.commit();
conn.close();

Monday, September 29, 2014

How to capture all tweets via PhP?

What's the method to get all tweets for a given query?

  • You need to login to http://dev.twitter.com to create an application and capture the below elements and store them in a file [app_tokens.php]
< ? php
           consumer_key = '';
           consumer_secret = '';
           user_token = '';
           user_secret = '';

?>
< ? php
require 'app_tokens.php';
require 'tmhOAuth-master/tmhOAuth.php';
$query = htmlspecialchars($_GET['query']);
if (empty($query)) {
    $query = "ModiInAmerica";
}
$connection = new tmhOAuth(array(
    'consumer_key' => $consumer_key,
    'consumer_secret' => $consumer_secret,
    'user_token' => $user_token,
    'user_secret' => $user_secret
));
// Get the timeline with the Twitter API
$http_code = $connection->request('GET',
    $connection->url('1.1/search/tweets'),
    array('q' => $query, 'count' => 100, 'lang' => 'en'));
// Request was successful
if ($http_code == 200) {
    // Extract the tweets from the API response
    $response = json_decode($connection->response['response'],true);
    $tweet_data = $response['statuses'];

    // Accumulate tweets from results
    $tweet_stream = '[';
    foreach ($tweet_data as $tweet) {
        // Add this tweet's text to the results
        $tweet_stream .= ' { "tweet": ' . json_encode($tweet['text']) . ' },';
    }
    $tweet_stream = substr($tweet_stream, 0, -1);
    $tweet_stream .= ']';
    // Send the tweets back to the Ajax request
    print $tweet_stream;
}
// Handle errors from API request
else {
    if ($http_code == 429) {
        print 'Error: Twitter API rate limit reached';
    }
    else {
        print 'Error: Twitter was not able to process that request';
    }
}
?>

  • Under browser run localhost/~username/search.php

Sunday, September 14, 2014

How fast is AWS RedShift?

Words buzzing BigData, BigData, BigData...how big is really your data and how soon you could load to get insightful information for your end users?
Working on a project where velocity is about 10~15 GB daily on a structured format [35million recs].

Before being exposed to RedShift@AWS, have been using cloud computing from Amazon and was quite happy to the bandwidth used for ETL purpose on performance. Cloud computing was achieved via S3 > EC2 > RDS stack.

Soon after getting exposed to Redshift@AWS, a sparkling fast way to load and further query on a column storaged based DB was just fascinating. It was just 120mins to load #20Days of data with the above mentioned velocity. That is without any parallel activities while loading. There were neither sortkeys provided, nor any indexes or constraints. This is with just one cluster and the basic mode of using this column store fashioned DB.

I am quite interested in bunch of mathematical based usage on querying such data. So could freely use set based operations {though it's slight different than Oracle, but mostly same}. Also could  make use of analytical queries and is really really fast.

Tuesday, April 29, 2014

How to get S3 files from AWS via Talend ETL?

This blog is to capture files from S3 [AWS] via Talend. There was a requirement to capture files comes in various patterns on clickstream data. Some files got common prefix [ ex Page, PageSummary, PageError].

Each tS3List:
- Uncheck "List all buckets objects"
- Provide your bucket name under "Bucket name", provide "Key prefix" as needed
In my case under the bucket, several directories. So used "directory_name/File Prefix"

This is how i could distinguish the above example of common prefix.
"directory_name/Page 2014-"
"directory_name/PageSummary"
"directory_name/PageError"


tS3Get:
Bucket: Provide your bucket name
Key: ((String)globalMap.get("tS3List_1_CURRENT_KEY"))  -- See NO double quotes
File: ""/Users/shota/"+((String)globalMap.get("tS3List_1_CURRENT_KEY"))

PS: These all files are connected with a central S3 connection object.

Any question, please provide a comment.

Friday, March 28, 2014

Mahout trunk under Eclipse

Here tried to capture few details, on how to integrate Mahout with Eclipse over Mac OS?


############Get MAHOUT src for eclipse Integration#########
> cd [place where you got Eclipse workspace]
> svn co http://svn.apache.org/repos/asf/mahout/trunk
> mvn eclipse:eclipse
> #######Set $PATH in .bash_profile in "home"
echo $HADOOP_HOME
/Users/shota/Downloads/hadoop-1.2.1

echo $MAHOUT_HOME
/Users/shota/Downloads/mahout-distribution-0.9
In my case going to OVERWRITE to "trunk" physical location under ".bash_profile". Check to make sure point to correct Mahout home now?

echo $MAHOUT_HOME
/Users/shota/Documents/workspace/trunk

> mvn install                  [ie can take a while]
:
:
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Mahout Build Tools ................................ SUCCESS [1.526s]
[INFO] Apache Mahout ..................................... SUCCESS [0.340s]
[INFO] Mahout Math ....................................... SUCCESS [54.567s]
[INFO] Mahout Core ....................................... SUCCESS [11:04.491s]
[INFO] Mahout Integration ................................ SUCCESS [1:11.533s]
[INFO] Mahout Examples ................................... SUCCESS [15.172s]
[INFO] Mahout Release Package ............................ SUCCESS [0.011s]
[INFO] Mahout Math/Scala wrappers ........................ SUCCESS [34.278s]
[INFO] Mahout Spark bindings ............................. SUCCESS [2:00.803s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 16:03.167s
[INFO] Finished at: Fri Mar 28 09:47:43 EDT 2014
[INFO] Final Memory: 65M/1086M
[INFO] ------------------------------------------------------------------------
################TEST MAHOUT works?
> mahout
:
:
  transpose: : Take the transpose of a matrix
  validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set
  vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors
  vectordump: : Dump vectors from a sequence file to text
  viterbi: : Viterbi decoding of hidden states from given output states sequence

Under Eclipse:
File > Import > Existing Maven Projects > [Browse where you got mahout trunk installed]

Once Finish, you need to restart Eclipse. Then you are good to go!






Monday, March 24, 2014

Riding on Mahout for Recommender

How to use Mahout's recommender? How to analyze recommended output?
How it works on big data sets? Let's take a quick ride on Mahout's recommendation engine.

Mahout's got recommendation algorithm named "recommenditembased" as "recommend-item-based" in their ML algorithms stack. Tried in OS-X Mac for a fairly large data set of 100k and then another massive stream of 1M movie ratings from GroupLens [http://grouplens.org/datasets/movielens/]. Initially run the small and then run the big.

Input Format (CSV - 3 columns only):
User(#ID), Movie (#ID), Rating (Scale 1-5)

i.e: The details of User and Movie are in separate files with demographics or extra information.

Once downloaded Mahout, you can simple go to bin diretory and run below commands. For simplicity, created a shell script[recommend-commands.sh] which takes into Mahout (bin) and then run these commands. You can uncomment the ones you don't need before running your script.

recommend-commands.sh
cd /Users/shota/Downloads/mahout-distribution-0.9/bin
rm -rf ./temp ./output
#mahout recommenditembased --input /Users/shota/Downloads/ml-100k/u.data-noTS.csv --output output/ --similarityClassname SIMILARITY_PEARSON_CORRELATION
#mahout recommenditembased --input /Users/shota/Downloads/ml-100k/u.data-noTS.csv --output output/ --similarityClassname SIMILARITY_COSINE
#mahout recommenditembased --input /Users/shota/Downloads/ml-100k/u.data-noTS.csv --output output/ --similarityClassname SIMILARITY_LOGLIKELIHOOD
#mahout recommenditembased --input /Users/shota/Downloads/ml-100k/u.data-noTS.csv --output output/ --similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE

#Million Ratings
#mahout recommenditembased --input /Users/shota/Downloads/ml-1m/ratings.csv --output output/ --similarityClassname SIMILARITY_COSINE
mahout recommenditembased --input /Users/shota/Downloads/ml-1m/ratings.csv --output output/ --similarityClassname SIMILARITY_LOGLIKELIHOOD

You can see similarity can be provided in many ways via Mahout's engine. You can check the output of each of them and them compare which best fits your dataset.

Once this runs, then you can get into "outout" directory and then capture the file nam as "part-r-00000". This is the recommender o/p.






Output:
shota$ head -5 part-r-00000 
1 [1560:5.0,737:5.0,69:5.0,1351:5.0,1194:5.0,481:5.0,474:5.0,68:5.0,449:5.0,25:5.0]
2 [462:5.0,347:5.0,515:5.0,895:5.0,234:5.0,282:5.0,129:5.0,88:5.0,237:5.0,121:5.0]
3 [137:5.0,285:5.0,654:5.0,693:5.0,531:5.0,124:5.0,508:5.0,129:5.0,150:5.0,47:5.0]
4 [282:5.0,121:5.0,895:5.0,234:5.0,275:5.0,690:5.0,1238:5.0,237:5.0,814:5.0,255:5.0]
5 [582:5.0,403:5.0,47:5.0,156:5.0,237:5.0,67:5.0,1016:5.0,608:5.0,128:5.0,276:5.0]

Once you get that you can map the User_ID[1,2,3..] and the recommended movies like with ratings separated as [Movie_ID:Rating]. That's quite to useful to know.

Also when i tried 1M data sets, it's fairly fast..just it took less than a minute to process results using the MapReduce framework in built with it. Kinda cool!

Other Info:
echo $HADOOP_HOME
/Users/shota/Downloads/hadoop-1.2.1

echo $MAHOUT_HOME
/Users/shota/Downloads/mahout-distribution-0.9

[2.2.0 is NOT fully compatible w/ Mahout 0.9 at this moment - Mar 28, 2014]

Related Readings on recommendation:
Short paper: http://infolab.stanford.edu/~ullman/mmds/ch9.pdf
Amazon.com Recommendations  http://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf