Lambda コンテナイメージで Tesseract OCR / pytesseract を実行する方法

Takahiro Iwasa (岩佐孝浩)

2022年6月22日

5 min read

Lambda Python

Lambda コンテナイメージを使用して、 Tesseract OCR と pytesseract を実行できます。

Working with Lambda container images - AWS Lambda

docs.aws.amazon.com

Working with Lambda container images - AWS Lambda

Create a container image for a Lambda function by using an AWS provided base image or an alternative base image.

この投稿のサンプルは、 GitHub リポジトリから取得できます。

GitHub - iwstkhr/aws-lambda-tesseract-ocr-sample: Tesseract OCR Sample with AWS Lambda Container Images using AWS SAM

github.com

GitHub - iwstkhr/aws-lambda-tesseract-ocr-sample: Tesseract OCR Sample with AWS Lambda Container Images using AWS SAM

Tesseract OCR Sample with AWS Lambda Container Images using AWS SAM - iwstkhr/aws-lambda-tesseract-ocr-sample

前提条件

以下のソフトウェアをインストールしてください。

AWS SAM
Python 3.x

SAM アプリケーション作成

ディレクトリ構成

/
|-- src/
|   |-- Dockerfile
|   |-- __init__.py
|   |-- app.py
|   |-- requirements.txt
|   `-- run-melos.pdf
|-- README.md
|-- __init__.py
|-- requirements.txt
`-- template.yaml

AWS SAM テンプレート

API Gateway の最大タイムアウト制限が29秒であり、サンプルの Python スクリプトが2分以上実行されるため、以下の例では Lambda トリガーとして EventBridge を使用しています。

50 milliseconds - 29 seconds for all integration types, including Lambda, Lambda proxy, HTTP, HTTP proxy, and AWS integrations.

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Tesseract OCR Sample with AWS Lambda Container Images using AWS SAM
Resources:
  TesseractOcrSample:
    Type: AWS::Serverless::Function
    Properties:
      Events:
        Schedule:
          Type: Schedule
          Properties:
            Enabled: true
            Schedule: cron(0 * * * ? *)
      MemorySize: 512
      PackageType: Image
      Timeout: 900
    Metadata:
      DockerTag: latest
      DockerContext: ./src/
      Dockerfile: Dockerfile

Dockerfile

以下の内容で Dockerfile を作成してください。もし日本語などのローカル言語を使用する場合は、 ENV LANG=ja_JP.UTF-8 を追加してください。そうしないと、 Docker の標準出力が文字化けする可能性があります。

FROM public.ecr.aws/lambda/python:3.9

ENV LANG=ja_JP.UTF-8
WORKDIR ${LAMBDA_TASK_ROOT}
COPY app.py ./
COPY requirements.txt ./
COPY run-melos.pdf ./
RUN rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm \
    && yum update -y && yum install -y poppler-utils tesseract tesseract-langpack-jpn \
    && pip install -U pip && pip install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"

CMD ["app.lambda_handler"]

Python スクリプト

requirements.txt を作成し、 pip install -r requirements.txt を使用してインストールしてください。

pdf2image==1.16.0
pytesseract==0.3.9

以下のコードで app.py を作成してください。

import re
from datetime import datetime

import pdf2image
import pytesseract


def lambda_handler(event: dict, context: dict) -> None:
    start = datetime.now()
    result = ''

    images = to_images('run-melos.pdf', 1, 2)
    for image in images:
        result += to_string(image)
    result = normalize(result)

    end = datetime.now()
    duration = end.timestamp() - start.timestamp()

    print('----------------------------------------')
    print(f'Start: {start}')
    print(f'End: {end}')
    print(f'Duration: {int(duration)} seconds')
    print(f'Result: {result}')
    print('----------------------------------------')


def to_images(pdf_path: str, first_page: int = None, last_page: int = None) -> list:
    """ Convert a PDF to a PNG image.

    Args:
        pdf_path (str): PDF path
        first_page (int): First page starting 1 to be converted
        last_page (int): Last page to be converted

    Returns:
        list: List of image data
    """

    print(f'Convert a PDF ({pdf_path}) to a png...')
    images = pdf2image.convert_from_path(
        pdf_path=pdf_path,
        fmt='png',
        first_page=first_page,
        last_page=last_page,
    )
    print(f'A total of converted png images is {len(images)}.')
    return images


def to_string(image) -> str:
    """ OCR an image data.

    Args:
        image: Image data

    Returns:
        str: OCR processed characters
    """

    print(f'Extract characters from an image...')
    return pytesseract.image_to_string(image, lang='jpn')


def normalize(target: str) -> str:
    """ Normalize result text.

    Applying the following:
    - Remove newlines.
    - Remove spaces between Japanese characters.

    Args:
        target (str): Target text to be normalized

    Returns:
        str: Normalized text
    """

    result = re.sub('\n', '', target)
    result = re.sub('([あ-んア-ン一-鿐])\s+((?=[あ-んア-ン一-鿐]))', r'\1\2', result)
    return result

ビルド

sam build を実行してビルドしてください。

$ sam build

...

Build Succeeded

Built Artifacts  : .aws-sam/build
Built Template   : .aws-sam/build/template.yaml

Commands you can use next
=========================
[*] Validate SAM template: sam validate
[*] Invoke Function: sam local invoke
[*] Test Function in the Cloud: sam sync --stack-name {stack-name} --watch
[*] Deploy: sam deploy --guided

このスクリプトをローカル環境で実行するには、 sam local invoke を実行してください。

$ sam local invoke

...

START RequestId: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx Version: $LATEST
Convert a PDF (run-melos.pdf) to a png...
A total of converted png images is 2.
Extract characters from an image...
Extract characters from an image...
----------------------------------------
Start: 2022-06-19 17:37:36.001748
End: 2022-06-19 17:40:18.842054
Duration: 162 seconds
Result:  PDD図書館管理番号 000.000002ー800 走れメロス太宰治=作メロスは激怒した。
...
----------------------------------------
END RequestId: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
REPORT RequestId: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  Init Duration: 1.09 ms  Duration: 163525.15 ms  Billed Duration: 163526 ms      Memory Size: 512 MB     Max Memory Used: 512 MB

デプロイ

ECR リポジトリがない場合は、次のコマンドで作成してください。

aws ecr create-repository --repository-name tesseract-ocr-lambda

--image-repository の値を ECR リポジトリに置き換え、以下のコマンドでアプリケーションをデプロイしてください。

$ sam deploy \
  --stack-name aws-lambda-tesseract-ocr-sample \
  --image-repository 123456789012.dkr.ecr.ap-northeast-1.amazonaws.com/tesseract-ocr-lambda \
  --capabilities CAPABILITY_IAM

...

Successfully created/updated stack - aws-lambda-tesseract-ocr-sample in None

デプロイ後、 Lambda 関数は毎時実行され、 OCR の結果が CloudWatch Logs に書き込まれます。

クリーンアップ

以下のコマンドを使用して、プロビジョニングされた AWS リソースを削除してください。

sam delete --stack-name aws-lambda-tesseract-ocr-sample