Wednesday, November 27, 2019

OCR via pytesseract (Capture text from the image)!!


As taken from the site:
Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.

This is based from Google's Tesseract  https://github.com/tesseract-ocr/tesseract


Here will show you how to run via Google' Colab interface.

Image File (test.png):

In Google's Collab > Open a Python3 notebook :

#Install these Python Libraries
!sudo pip install pytesseract
!sudo apt install tesseract-ocr
!sudo apt install libtesseract-dev

#Read Image and extract text
from PIL import Image
import PIL.Image

from pytesseract import image_to_string
import pytesseract

from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
ocr_file = root_dir + 'YOUR_DRIVE/test.png'
pytesseract.tesseract_cmd = r'/usr/local/bin/pytesseract'

TESSDATA_PREFIX = '/usr/local/bin/pytesseract'
img = Image.open(ocr_file)
output = pytesseract.image_to_string (img, lang='eng')
print (output)

Outcome:

Wednesday, November 13, 2019

Statistical Clustering with PDI on Weka and Scikit Learn engines!!


Here, will show how you can use various clustering models (KMeans or EM) using Weka and Python's machine learning library for data in flight!


Installations: 
- PMI plugin under Market Place is installed
- Scikit Learn from Anaconda is installed.

PDI job as workflow:
Below is a PDI-Pentaho Data Integration job which does following.
- Clean up step: Deletes all models
-  transformation creates Weka based Clustering models
-  transformation to score (ie assign a cluster)
-  transformation uses Python Scikit_learn package for clustering

















This creates Weka Knowledge Flow (via KettleInject as see in pic) and generates a model file for the clustering algorithm of choice.












This transformation assigns the cluster for the supplied test data.
This transformation executes Python code as in the picture.












Conclusion: PDI can work with various machine learning engines and this case showed clustering algorithms as part of data pipeline.

PDI- Pentaho Data Integration (aka Spoon)
PMI - Plugin Machine Intelligence

PyTorch towards image classification by Google's colab notebook...



Among the deep learning libraries was exploring PyTorch towards image classification !!

Python Notebook:  Google's colab notebooks
You can also set GPU: Runtime > Change Runtime Type as these classifications may take long for larger datasets.

Dataset: CIFAR-10 dataset that consists of 60,000 images sized 32 x 32 pixels. The dataset contains 10 classes that are mutually exclusive (do not overlap)with each class containing 6,000 images. The images are small, clearly labelled and have no noise which makes the dataset ideal for this task with considerably much less pre-processing.




Once the data is imported, you can classify the images as below!




As observed, it's using torchvision library : https://github.com/pytorch/vision which basically provides common image transformations for computer vision. Also observed that few images are blurring, and it still correctly classify those ones as well. Explore more via PyTorch and Google's colab notebook is just slick!!

Saturday, July 06, 2019

Real time prediction ie (Prediction as Service) using Pentaho at Hitachi !!


How Pentaho is used towards Prediction as Service (real time prediction from machine learning models)!!

Here will show how you can use R based machine learning models from Spoon's PMI-Plugin Machine Intelligence to generate ML models and how to access ML models real time from Ctool's CDA-Community Data Access.


Data set : "Give me some credit" Kaggle

Task: Predict delinquency in next 2 years from a given data sets (links below).




Server &Tools :
A running Pentaho Server
Clients:
PDI - Pentaho Data Integration aka Spoon (towards jobs and transformations)
CDA- Community Data Access (comes with Pentaho Server) - Here executes transformation real time gets output in browser and returns JSON.

PUC Overview:





Here you can double click get_Score (Green ball) and it opens up in browser. Then you change attributes as desired and it will predict the first attribute (class attribute).







Below is a snapshot while changing the values from fields, you can click on Arrow so that it will execute "tf_getScore" transformation, which uses a ML machine learning (ie Decision Tree) and predict result in the first column. Also the data can be called via REST API which gives JSON payload, can be used for further processing (shown below pictures) in the pipe line.












Machine Learning Model generation and Scoring:

You can download PMI-Plugin Machine Intelligence under Spoon (Pentaho Data Integration) to create models. Here is an example of how to create model. Here you can use testing method as CV, Test

Generate ML Model












Scoring from a generated ML model:






Ultimately you can also stored predicted result in database from batch oriented system perspective and then create Mondrian model to visualize under Pentaho Analyzer.



Code: Download and upload in PUC-Pentaho User Console under "public"                                                                  
Conclusion: 
Pentaho's PMI enables in deploying machine learning models faster, via testing/scoring models, feature selection, tuning parameters. Also we saw you can visualize or extract to JSON through real time access of ML models.