Problem: To capture bi-grams from a text document.
Input: Test_File.txt
I had to put forth more effort as saving to go to go to go to go to go to go to go to go to go to go to go in my retirement . I had to go to school to get ticket to go to movie tomorrow to go to go .
Output:
(('to', 'go'), 15)
(('go', 'to'), 13)
(('I', 'had'), 2)
(('had', 'to'), 2)
(('forth', 'more'), 1)
(('retirement', '.'), 1)
(('to', 'put'), 1)
(('tomorrow', 'to'), 1)
(('to', 'movie'), 1)
(('movie', 'tomorrow'), 1)
Input: Test_File.txt
I had to put forth more effort as saving to go to go to go to go to go to go to go to go to go to go to go in my retirement . I had to go to school to get ticket to go to movie tomorrow to go to go .
Output:
(('to', 'go'), 15)
(('go', 'to'), 13)
(('I', 'had'), 2)
(('had', 'to'), 2)
(('forth', 'more'), 1)
(('retirement', '.'), 1)
(('to', 'put'), 1)
(('tomorrow', 'to'), 1)
(('to', 'movie'), 1)
(('movie', 'tomorrow'), 1)
Python Code:
import itertools
from collections import Counter
f = open('C:\Python27\Test_File.txt')
data = f.readlines()
for line in data:
words = line.split()
nextword = iter(words)
next(nextword)
freq = Counter(zip(words,nextword))
for item in freq.most_common(10):
print item