Fetching and transforming data from S3 using AWS Lambda is one of the most common serverless patterns. The new S3 Object Lambda feature allows a Lambda to be invoked on demand as part of the lifecycle of S3 GetObject. This opens up a new set of possibilities. Objects can be transformed, filtered and generated on the fly without adding higher level access points like API Gateways. I’ll give a full code example of this below. First, let’s summarise the main points of S3 Object Lambda.
- It works for GetObject, ListObjects and HeadObject. There is no support for intercepting PutObject. (Updated 2023–05–15 to reflect the added support for List and Head)
- Clients use GetObject as normal but replace the bucket name with an S3 Object Lambda Access Point ARN. Signed URLs can of course be used as an abstraction for this (Update: I haven’t yet seen a working implementation of signed URLs for Lambda Access Points)
- The object being requested does not have to exist in the underlying bucket. Every GET request is intercepted by a Lambda invocation and the response can transform a real object or generate new data.
- Chunked and multi-part transformation is supported.
- AWS CLI commands like
aws s3 cpdon’t work with this feature.
For a while, I have wondered if S3 would provide Lambda-in-the-bucket support, allowing for transformation of data close to the source, giving better data locality to these operations. S3 Object Lambda doesn’t seem to be that exactly. Instead, what we get is a standard Lambda environment with a new trigger and response mechanism that is designed to hook into S3 GetObject requests.
Migration on demand
A common use case for object transformation is in data migration. For a set of tabular data stored as CSV, applications may require columns to be added or removed. Some applications may prefer it to be compressed, others uncompressed. As access patterns change, you might prefer to read the data in Parquet format.
Let’s create an S3 Object Lambda that will process requests for CSV or Parquet files. Parquet files will be generated as requested. The logic will be as follows.
- Attempt to read the Parquet file directly and return it if it exists
- If the Parquet file is missing, try to get a CSV version of the same key, replacing the file extension.
- Convert the CSV file to Parquet and return it.
The happy path for this is reasonably straightforward. Step 1 uses the special signed URL provided in the Lambda event. For step 2, we use boto3 to load the CSV object. In both cases, failures can occur so we need to propagate error codes correctly. It would be ideal to account for large files and deal with data in chunks but we’ll keep this example simple.
The stack for our sample application includes the Lambda function as well as these resources:
- The S3 bucket
- An S3 Bucket Access Point
- An S3 Object Lambda Access Point
The logic of the Lambda function will attempt to fetch the requested key using the provided signed URL. If a parquet file was requested but not found, we then load the same key with a ‘.csv’ extension before converting to Parquet using Pandas.
The Lambda IAM Role must have the s3-object-lambda:WriteGetObjectResponse permission as well as access to read from the bucket itself in order to load CSV data.
Full source code for the example is available on GitHub: https://github.com/eoinsha/object-lambda-transform.
Testing it Out
By using boto3, we can invoke GetObject against the bucket and against the Lambda Access Point. When using the access point, requests for both CSV and Parquet give us the same tabular data, even though only the CSV exists in the bucket!
This simple migration example can also be applied to other use cases, like:
- Providing a default response for keys that do not exist
- Filtering data based on part of the key or the user making the request
- Aggregating on demand
- Lazy migration. We could expand our migration example to put the converted data into the bucket so it would not need to be generated for subsequent requests.
Building on Bucket Access Points
S3 Object Lambdas are built on S3 Bucket Access Points, a relatively recent concept that allows for better access control at the resource side. Bucket Access Points have their own policy, allowing you to provide varying resource-level permissions for different use cases. This includes the ability to restrict access to a specific VPC.
By leveraging different access points, access to predefined key prefixes or suffixes can be controlled for one or more S3 Object Lambda functions depending on the user or application retrieving the object.
As with many new features, just because they’re there doesn’t mean you have to use them! We may not need the added complexity this kind of dynamic behaviour brings. I generally prefer to be clear and explicit in software design and avoid hooks that introduce hidden logic. The beauty of S3 is down to its simple-yet-powerful, key-object store design. It’s worth taking a minute to check if there is a simpler, clearer implementation that sticks with that simplicity.
At the same time, there are plenty of real use cases that can take advantage of the ability to generate and transform objects on the fly. S3 Object Lambdas are are really valuable addition to development on AWS.
References and Further Reading
- Example source code — https://github.com/eoinsha/object-lambda-transform
- Introducing Amazon S3 Object Lambda — Use Your Code to Process Data as It Is Being Retrieved from S3, Danilo Poccia — https://aws.amazon.com/blogs/aws/introducing-amazon-s3-object-lambda-use-your-code-to-process-data-as-it-is-being-retrieved-from-s3/
- S3ObjectLambda CloudFormation Reference — https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/AWS_S3ObjectLambda.html
- Transforming Objects with AWS Lambda, AWS Documentation — https://docs.aws.amazon.com/AmazonS3/latest/userguide/transforming-objects.html
- Comprehend S3 Object Lambda functions — https://github.com/aws-samples/amazon-comprehend-s3-object-lambdas