OCR: Extract Text from Image In 8 Easy Steps

Syndication Cloud
Today at 1:25pm UTC
Extract Text from ImagePhoto from Unsplash

Originally Posted On: https://medium.com/@pawan329/ocr-extract-text-from-image-in-8-easy-steps-3113a1141c34

In today’s digital world, extracting text from images has become a crucial task in various applications, such as document digitization, text recognition, and data extraction. Python provides several powerful libraries that enable us to perform Optical Character Recognition (OCR) to extract text from images effortlessly. In this article, we will explore the process of extracting text from images using Python, focusing on the popular Tesseract OCR engine.

We’ll use Pytesseract to perform this task.
Pytesseract is an OCR library in Python that is used to extract text from images. Python-tesseract is a Python wrapper for Google’s Tesseract-OCR.

Step 1. Install Tesseract on your machine

Visit https://github.com/UB-Mannheim/tesseract/wiki and download Tesseract installer for Windows.

Press enter or click to view image in full size

After downloading this (.exe file), Double click and start installation.
Note: Keep the default setting and press ‘Next, Next.. and complete installation’

Step 2. Install Tesseract OCR and Required Libraries

To get started, we need to install Tesseract OCR and the necessary Python libraries. Tesseract is an open-source OCR engine maintained by Google.

For Python, we will need the following libraries:

Step 3. Import Libraries

Once we have installed the required libraries, let’s import them into our Python script:

Step 4. Define the tesseract_cmd path.

Tesseract_cmd path might me difference in your case, To find the right path please check your “tesseract installation” directory.

pytesseract.pytesseract.tesseract_cmd = r’C:Program FilesTesseract-OCRtesseract.exe’

For developers working in C# or VB.NET

IronOCR simplifies this process significantly. No separate Tesseract installation, no PATH configuration, no tesseract_cmd path issues. Just install the NuGet package and start extracting text.

Press enter or click to view image in full size

The accuracy depends on the same factors: image quality, resolution, and font ›style. IronOCR includes preprocessing methods to help with these issues without needing external libraries.

Step 5. Load the Image

Next, we need to load the image from which we want to extract text. Make sure the image file is in a format supported by Tesseract (e.g., PNG, JPEG, GIF). For this example, let’s assume the image file is named “image.png”:

Step 6. Perform OCR and Extract Text

Now that we have loaded the image, we can use pytesseract to perform OCR and extract the text from the image. The image_to_string function from pytesseract is used for this purpose:

Step 7. Post-Processing (Optional)

Sometimes, the OCR output may contain extra spaces, line breaks, or characters that need to be cleaned up. Depending on the specific use case, you might want to perform some post-processing on the extracted text. Here’s an example of removing leading and trailing whitespaces:

Step 8. Display the Extracted Text

For demonstration purposes, we’ll print the extracted text:

Conclusion

We have explored the process of extracting text from images using Python. We used the pytesseract library, which serves as a Python wrapper for the powerful Tesseract OCR engine. By following the step-by-step guide and the provided Python code, you can easily extract text from images and use it for various applications in your projects.

Remember that the accuracy of the OCR output depends on factors such as image quality, resolution, and font style. Therefore, it’s essential to fine-tune the OCR process according to your specific use case and make any necessary adjustments for optimal results.