How to Run Tesseract OCR with pytesseract in Lambda Container Images

How to Run Tesseract OCR with pytesseract in Lambda Container Images

Takahiro Iwasa
(岩佐 孝浩)
Takahiro Iwasa (岩佐 孝浩)
4 min read
Lambda Python

Developers can run Tesseract OCR with pytesseract using Lambda container images.

You can pull an example code used in this post from my GitHub repository.


Install the following on you computer.

Creating SAM Application

Directory Structure

|-- src/
|   |-- Dockerfile
|   |--
|   |--
|   |-- requirements.txt
|   `-- run-melos.pdf
|-- requirements.txt
`-- template.yaml

AWS SAM Template

The example below uses EventBridge as a Lambda trigger because API Gateway has a maximum timeout limit of 29 seconds and the sample Python script runs for more than 2 minutes.

50 milliseconds - 29 seconds for all integration types, including Lambda, Lambda proxy, HTTP, HTTP proxy, and AWS integrations.

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Tesseract OCR Sample with AWS Lambda Container Images using AWS SAM
    Type: AWS::Serverless::Function
          Type: Schedule
            Enabled: true
            Schedule: cron(0 * * * ? *)
      MemorySize: 512
      PackageType: Image
      Timeout: 900
      DockerTag: latest
      DockerContext: ./src/
      Dockerfile: Dockerfile


Create Dockerfile with the following content. If you intend to use your local language like Japanese, add ENV LANG=ja_JP.UTF-8; otherwise you should see garbled texts in Docker standard output.


COPY requirements.txt ./
COPY run-melos.pdf ./
RUN rpm -Uvh \
    && yum update -y && yum install -y poppler-utils tesseract tesseract-langpack-jpn \
    && pip install -U pip && pip install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"

CMD ["app.lambda_handler"]

Python Script

Create requirements.txt and install using pip install -r requirements.txt.


Create with the following code.

import re
from datetime import datetime

import pdf2image
import pytesseract

def lambda_handler(event: dict, context: dict) -> None:
    start =
    result = ''

    images = to_images('run-melos.pdf', 1, 2)
    for image in images:
        result += to_string(image)
    result = normalize(result)

    end =
    duration = end.timestamp() - start.timestamp()

    print(f'Start: {start}')
    print(f'End: {end}')
    print(f'Duration: {int(duration)} seconds')
    print(f'Result: {result}')

def to_images(pdf_path: str, first_page: int = None, last_page: int = None) -> list:
    """ Convert a PDF to a PNG image.

        pdf_path (str): PDF path
        first_page (int): First page starting 1 to be converted
        last_page (int): Last page to be converted

        list: List of image data

    print(f'Convert a PDF ({pdf_path}) to a png...')
    images = pdf2image.convert_from_path(
    print(f'A total of converted png images is {len(images)}.')
    return images

def to_string(image) -> str:
    """ OCR an image data.

        image: Image data

        str: OCR processed characters

    print(f'Extract characters from an image...')
    return pytesseract.image_to_string(image, lang='jpn')

def normalize(target: str) -> str:
    """ Normalize result text.

    Applying the following:
    - Remove newlines.
    - Remove spaces between Japanese characters.

        target (str): Target text to be normalized

        str: Normalized text

    result = re.sub('\n', '', target)
    result = re.sub('([あ-んア-ン一-鿐])\s+((?=[あ-んア-ン一-鿐]))', r'\1\2', result)
    return result


Build by running sam build.

$ sam build


Build Succeeded

Built Artifacts  : .aws-sam/build
Built Template   : .aws-sam/build/template.yaml

Commands you can use next
[*] Validate SAM template: sam validate
[*] Invoke Function: sam local invoke
[*] Test Function in the Cloud: sam sync --stack-name {stack-name} --watch
[*] Deploy: sam deploy --guided

To run this script in your local environment, run sam local invoke.

$ sam local invoke


START RequestId: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx Version: $LATEST
Convert a PDF (run-melos.pdf) to a png...
A total of converted png images is 2.
Extract characters from an image...
Extract characters from an image...
Start: 2022-06-19 17:37:36.001748
End: 2022-06-19 17:40:18.842054
Duration: 162 seconds
Result:  PDD図書館管理番号 000.000002ー800 走れメロス太宰治=作メロスは激怒した。
END RequestId: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
REPORT RequestId: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  Init Duration: 1.09 ms  Duration: 163525.15 ms  Billed Duration: 163526 ms      Memory Size: 512 MB     Max Memory Used: 512 MB


If you do not have an ECR repository, create one with the following command.

aws ecr create-repository --repository-name tesseract-ocr-lambda

Replace --image-repository value with your ECR repository, and deploy the application with the following command.

$ sam deploy \
  --stack-name aws-lambda-tesseract-ocr-sample \
  --image-repository \
  --capabilities CAPABILITY_IAM


Successfully created/updated stack - aws-lambda-tesseract-ocr-sample in None

After deployment, your Lambda function will run every hour and OCR results will be written to CloudWatch Logs.

Cleaning Up

Clean up the provisioned AWS resources with the following command.

sam delete --stack-name aws-lambda-tesseract-ocr-sample
Takahiro Iwasa
(岩佐 孝浩)

Takahiro Iwasa (岩佐 孝浩)

Software Developer at iret, Inc.
Architecting and developing cloud native applications mainly with AWS. Japan AWS Top Engineers 2020-2023