Wednesday, November 27, 2019

OCR via pytesseract (Capture text from the image)!!


As taken from the site:
Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.

This is based from Google's Tesseract  https://github.com/tesseract-ocr/tesseract


Here will show you how to run via Google' Colab interface.

Image File (test.png):

In Google's Collab > Open a Python3 notebook :

#Install these Python Libraries
!sudo pip install pytesseract
!sudo apt install tesseract-ocr
!sudo apt install libtesseract-dev

#Read Image and extract text
from PIL import Image
import PIL.Image

from pytesseract import image_to_string
import pytesseract

from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
ocr_file = root_dir + 'YOUR_DRIVE/test.png'
pytesseract.tesseract_cmd = r'/usr/local/bin/pytesseract'

TESSDATA_PREFIX = '/usr/local/bin/pytesseract'
img = Image.open(ocr_file)
output = pytesseract.image_to_string (img, lang='eng')
print (output)

Outcome:

Wednesday, November 13, 2019

Statistical Clustering with PDI on Weka and Scikit Learn engines!!


Here, will show how you can use various clustering models (KMeans or EM) using Weka and Python's machine learning library for data in flight!


Installations: 
- PMI plugin under Market Place is installed
- Scikit Learn from Anaconda is installed.

PDI job as workflow:
Below is a PDI-Pentaho Data Integration job which does following.
- Clean up step: Deletes all models
-  transformation creates Weka based Clustering models
-  transformation to score (ie assign a cluster)
-  transformation uses Python Scikit_learn package for clustering

















This creates Weka Knowledge Flow (via KettleInject as see in pic) and generates a model file for the clustering algorithm of choice.












This transformation assigns the cluster for the supplied test data.
This transformation executes Python code as in the picture.












Conclusion: PDI can work with various machine learning engines and this case showed clustering algorithms as part of data pipeline.

PDI- Pentaho Data Integration (aka Spoon)
PMI - Plugin Machine Intelligence

PyTorch towards image classification by Google's colab notebook...



Among the deep learning libraries was exploring PyTorch towards image classification !!

Python Notebook:  Google's colab notebooks
You can also set GPU: Runtime > Change Runtime Type as these classifications may take long for larger datasets.

Dataset: CIFAR-10 dataset that consists of 60,000 images sized 32 x 32 pixels. The dataset contains 10 classes that are mutually exclusive (do not overlap)with each class containing 6,000 images. The images are small, clearly labelled and have no noise which makes the dataset ideal for this task with considerably much less pre-processing.




Once the data is imported, you can classify the images as below!




As observed, it's using torchvision library : https://github.com/pytorch/vision which basically provides common image transformations for computer vision. Also observed that few images are blurring, and it still correctly classify those ones as well. Explore more via PyTorch and Google's colab notebook is just slick!!