Wednesday, November 27, 2019

OCR via pytesseract (Capture text from the image)!!


As taken from the site:
Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.

This is based from Google's Tesseract  https://github.com/tesseract-ocr/tesseract


Here will show you how to run via Google' Colab interface.

Image File (test.png):

In Google's Collab > Open a Python3 notebook :

#Install these Python Libraries
!sudo pip install pytesseract
!sudo apt install tesseract-ocr
!sudo apt install libtesseract-dev

#Read Image and extract text
from PIL import Image
import PIL.Image

from pytesseract import image_to_string
import pytesseract

from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
ocr_file = root_dir + 'YOUR_DRIVE/test.png'
pytesseract.tesseract_cmd = r'/usr/local/bin/pytesseract'

TESSDATA_PREFIX = '/usr/local/bin/pytesseract'
img = Image.open(ocr_file)
output = pytesseract.image_to_string (img, lang='eng')
print (output)

Outcome:

No comments: