Automating of Partitioning Athena Tables using Partition Projection

Automating of Partitioning Athena Tables using Partition Projection

Takahiro Iwasa
(岩佐 孝浩)
Takahiro Iwasa (岩佐 孝浩)
3 min read
Athena

Athena Partition Projection, introduced in June 2020, enables automatic partitioning of Athena tables. Using this feature, MSCK REPAIR TABLE query is no longer necessary to add new partitions.

Overview

Quoting from the official AWS documentation:

In partition projection, partition values and locations are calculated from configuration rather than read from a repository like the AWS Glue Data Catalog. 1

partition projection can reduce the runtime of queries against highly partitioned tables. 2

Partition projection allows Athena to avoid calling GetPartitions because the partition projection configuration gives Athena all of the necessary information to build the partitions itself. 3

Creating AWS Resources

Create a CloudFormation template with the following content. The key point is GlueTable.TableInput.Parameters of GlueTable definition (lines 27-35).

AWSTemplateFormatVersion: "2010-09-09"
Description: Stack for Athena partition projection sample
Resources:
  S3:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: athena-partition-projection-logs
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: AES256

  GlueDatabase:
    Type: AWS::Glue::Database
    Properties:
      DatabaseInput:
        Name: sample
      CatalogId: !Ref AWS::AccountId

  GlueTable:
    Type: AWS::Glue::Table
    Properties:
      DatabaseName: !Ref GlueDatabase
      CatalogId: !Ref AWS::AccountId
      TableInput:
        TableType: EXTERNAL_TABLE
        Parameters:
          classification: json
          "projection.enabled": true
          "projection.year_month.format": yyyy/MM
          "projection.year_month.interval": 1
          "projection.year_month.interval.unit": MONTHS
          "projection.year_month.range": 2021/09,NOW
          "projection.year_month.type": date
          "storage.location.template": s3://athena-partition-projection-logs/${year_month}
        StorageDescriptor:
          Columns:
            - Name: id
              Type: int
            - Name: message
              Type: string
          Location: !Sub s3://${S3}/
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Compressed: false
          NumberOfBuckets: 0
          SerdeInfo:
            SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
            Parameters:
              paths: id,message
          StoredAsSubDirectories: false
        PartitionKeys:
          - Name: year_month
            Type: string
        Retention: 0
        Name: sample_logs

Deploy the CloudFormation stack with the following command.

aws cloudformation deploy --template-file stack.yml --stack-name athena-partition-projection-sample

Testing

By running the following command, put objects containing the example JSON below.

$ echo '{"id": 1, "message": "hello"}' > 2021-09.json
$ echo '{"id": 2, "message": "world"}' > 2021-10.json
$ aws s3 cp 2021-09.json s3://athena-partition-projection-logs/2021/09/
$ aws s3 cp 2021-10.json s3://athena-partition-projection-logs/2021/10/
$ aws s3 ls s3://athena-partition-projection-logs/2021/
                           PRE 09/
                           PRE 10/

Let’s query the data in the 2021/09 partition.

SELECT * FROM "sample"."sample_logs"
WHERE year_month = '2021/09'
LIMIT 10;

Result:

1	hello	2021/09

Next, let’s query the data in the 2021/10 partition.

SELECT * FROM "sample"."sample_logs"
WHERE year_month = '2021/10'
LIMIT 10;

Result:

2	world	2021/10

Cleaning Up

Clean up the provisioned AWS resources with the following command.

aws s3 rm --recursive s3://athena-partition-projection-logs
aws cloudformation delete-stack --stack-name athena-partition-projection-sample

Footnotes

  1. https://docs.aws.amazon.com/athena/latest/ug/partitions.html#partitions-partition-projection

  2. https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html

  3. https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html#partition-projection-pruning-vs-projection

Takahiro Iwasa
(岩佐 孝浩)

Takahiro Iwasa (岩佐 孝浩)

Software Developer at KAKEHASHI Inc.
Involved in the design, development, and operation of the prescription data collection platform