Thursday, October 18, 2012

Simple Bi-gram finder (top 10 by frequency) by Python

Problem: To capture bi-grams from a text document.
Input: Test_File.txt


I had to put forth more effort as saving to go to go to go  to go  to go  to go  to go  to go  to go  to go  to go in my retirement . I had to go to school to get ticket to go to movie tomorrow  to go  to go .


Output:

(('to', 'go'), 15)
(('go', 'to'), 13)
(('I', 'had'), 2)
(('had', 'to'), 2)
(('forth', 'more'), 1)
(('retirement', '.'), 1)
(('to', 'put'), 1)
(('tomorrow', 'to'), 1)
(('to', 'movie'), 1)
(('movie', 'tomorrow'), 1)

Python Code:
import itertools
from collections import Counter

f = open('C:\Python27\Test_File.txt')
data = f.readlines()

for line in data:
words = line.split()

nextword = iter(words)
next(nextword)

freq = Counter(zip(words,nextword))
for item in freq.most_common(10):
print item

Sunday, October 14, 2012

Python as Feature Vector Generation

This post I will have two Python scripts which is pretty much used as a start point for any NLP based text mining solution. Vector generation for most frequent 500 words set.

Most frequent 500 BoWs (500_freq_words.py):

from string import punctuation
from operator import itemgetter

N = 500
words = {}

words_gen = (word.strip(punctuation).lower() for line in open("C:\Python27\All_Cleaned_Comment_Loyality_Change.txt")
                                             for word in line.split())

for word in words_gen:
    words[word] = words.get(word, 0) + 1
 
top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]

for word, frequency in top_words:
  print "%s %d" % (word, frequency)
-------------------------------------------------------------------------
Once you generate the most frequent 500 BoWs from a dataset, then you can store in a file using re-directional operator. Then you can use the file for the below vector generation.

Vector Generation:

from string import punctuation
from operator import itemgetter

words = {}
total_words = 0

import sys

for arg in sys.argv:
words_gen = (word.strip(punctuation).lower() for line in open(arg)
                                             for word in line.split())

for word in words_gen:
    words[word] = words.get(word, 0) + 1
 
top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)

#Capture Total Words
f = open(arg)
data = f.readlines()
for lines in data:
all_words = lines.split()
total_words += len(all_words)
f.close()

#print total_words

#Read the lines from Top 500 Words list
f_500 = open('C:\Python27\Top_500_words.txt')
data_500 = f_500.readlines()

#Loop in most frequent 500 List

for lines_500 in data_500:

 for word, frequency in top_words:

if lines_500 == word+'\n':  
    #print "%s %d" % (word, (frequency/float(total_words))*1000)
  print ', %f' % ((frequency/float(total_words))*1000),
  else:
print ', %f' % ((1/float(total_words))*1000),

print ', '+arg