How to Run Tesseract OCR with pytesseract in Lambda Container Images

How to Run Tesseract OCR with pytesseract in Lambda Container Images

Takahiro Iwasa
(岩佐 孝浩)
Takahiro Iwasa (岩佐 孝浩)
4 min read
Lambda Python

Developers can run Tesseract OCR with pytesseract using Lambda container images.

You can pull an example code used in this post from my GitHub repository.

Prerequisites

Install the following on you computer.

Creating SAM Application

Directory Structure

/
|-- src/
|   |-- Dockerfile
|   |-- __init__.py
|   |-- app.py
|   |-- requirements.txt
|   `-- run-melos.pdf
|-- README.md
|-- __init__.py
|-- requirements.txt
`-- template.yaml

AWS SAM Template

The example below uses EventBridge as a Lambda trigger because API Gateway has a maximum timeout limit of 29 seconds and the sample Python script runs for more than 2 minutes.

50 milliseconds - 29 seconds for all integration types, including Lambda, Lambda proxy, HTTP, HTTP proxy, and AWS integrations.

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Tesseract OCR Sample with AWS Lambda Container Images using AWS SAM
Resources:
  TesseractOcrSample:
    Type: AWS::Serverless::Function
    Properties:
      Events:
        Schedule:
          Type: Schedule
          Properties:
            Enabled: true
            Schedule: cron(0 * * * ? *)
      MemorySize: 512
      PackageType: Image
      Timeout: 900
    Metadata:
      DockerTag: latest
      DockerContext: ./src/
      Dockerfile: Dockerfile

Dockerfile

Create Dockerfile with the following content. If you intend to use your local language like Japanese, add ENV LANG=ja_JP.UTF-8; otherwise you should see garbled texts in Docker standard output.

FROM public.ecr.aws/lambda/python:3.9

ENV LANG=ja_JP.UTF-8
WORKDIR ${LAMBDA_TASK_ROOT}
COPY app.py ./
COPY requirements.txt ./
COPY run-melos.pdf ./
RUN rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm \
    && yum update -y && yum install -y poppler-utils tesseract tesseract-langpack-jpn \
    && pip install -U pip && pip install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"

CMD ["app.lambda_handler"]

Python Script

Create requirements.txt and install using pip install -r requirements.txt.

pdf2image==1.16.0
pytesseract==0.3.9

Create app.py with the following code.

import re
from datetime import datetime

import pdf2image
import pytesseract


def lambda_handler(event: dict, context: dict) -> None:
    start = datetime.now()
    result = ''

    images = to_images('run-melos.pdf', 1, 2)
    for image in images:
        result += to_string(image)
    result = normalize(result)

    end = datetime.now()
    duration = end.timestamp() - start.timestamp()

    print('----------------------------------------')
    print(f'Start: {start}')
    print(f'End: {end}')
    print(f'Duration: {int(duration)} seconds')
    print(f'Result: {result}')
    print('----------------------------------------')


def to_images(pdf_path: str, first_page: int = None, last_page: int = None) -> list:
    """ Convert a PDF to a PNG image.

    Args:
        pdf_path (str): PDF path
        first_page (int): First page starting 1 to be converted
        last_page (int): Last page to be converted

    Returns:
        list: List of image data
    """

    print(f'Convert a PDF ({pdf_path}) to a png...')
    images = pdf2image.convert_from_path(
        pdf_path=pdf_path,
        fmt='png',
        first_page=first_page,
        last_page=last_page,
    )
    print(f'A total of converted png images is {len(images)}.')
    return images


def to_string(image) -> str:
    """ OCR an image data.

    Args:
        image: Image data

    Returns:
        str: OCR processed characters
    """

    print(f'Extract characters from an image...')
    return pytesseract.image_to_string(image, lang='jpn')


def normalize(target: str) -> str:
    """ Normalize result text.

    Applying the following:
    - Remove newlines.
    - Remove spaces between Japanese characters.

    Args:
        target (str): Target text to be normalized

    Returns:
        str: Normalized text
    """

    result = re.sub('\n', '', target)
    result = re.sub('([あ-んア-ン一-鿐])\s+((?=[あ-んア-ン一-鿐]))', r'\1\2', result)
    return result

Build

Build by running sam build.

$ sam build

...

Build Succeeded

Built Artifacts  : .aws-sam/build
Built Template   : .aws-sam/build/template.yaml

Commands you can use next
=========================
[*] Validate SAM template: sam validate
[*] Invoke Function: sam local invoke
[*] Test Function in the Cloud: sam sync --stack-name {stack-name} --watch
[*] Deploy: sam deploy --guided

To run this script in your local environment, run sam local invoke.

$ sam local invoke

...

START RequestId: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx Version: $LATEST
Convert a PDF (run-melos.pdf) to a png...
A total of converted png images is 2.
Extract characters from an image...
Extract characters from an image...
----------------------------------------
Start: 2022-06-19 17:37:36.001748
End: 2022-06-19 17:40:18.842054
Duration: 162 seconds
Result:  PDD図書館管理番号 000.000002ー800 走れメロス太宰治=作メロスは激怒した。
...
----------------------------------------
END RequestId: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
REPORT RequestId: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  Init Duration: 1.09 ms  Duration: 163525.15 ms  Billed Duration: 163526 ms      Memory Size: 512 MB     Max Memory Used: 512 MB

Deploy

If you do not have an ECR repository, create one with the following command.

aws ecr create-repository --repository-name tesseract-ocr-lambda

Replace --image-repository value with your ECR repository, and deploy the application with the following command.

$ sam deploy \
  --stack-name aws-lambda-tesseract-ocr-sample \
  --image-repository 123456789012.dkr.ecr.ap-northeast-1.amazonaws.com/tesseract-ocr-lambda \
  --capabilities CAPABILITY_IAM

...

Successfully created/updated stack - aws-lambda-tesseract-ocr-sample in None

After deployment, your Lambda function will run every hour and OCR results will be written to CloudWatch Logs.

Cleaning Up

Clean up the provisioned AWS resources with the following command.

sam delete --stack-name aws-lambda-tesseract-ocr-sample
Takahiro Iwasa
(岩佐 孝浩)

Takahiro Iwasa (岩佐 孝浩)

Software Developer at iret, Inc.
Architecting and developing cloud native applications mainly with AWS. Japan AWS Top Engineers 2020-2023