We thought our AWS Lambda S3 trigger Python pipeline was bulletproof — clean invocations, no errors in CloudWatch, records flowing into DynamoDB. Then we audited the data and found duplicates, missing rows, and an AWS bill that had quietly doubled over a weekend burst upload. This is the honest post-mortem.
The pipeline was real production work: customers uploaded CSV files to S3, Lambda parsed each file and wrote rows to DynamoDB, and downstream systems read from that table. No EC2, no containers, no infra to babysit. That was the appeal. What we underestimated was how many invisible failure modes come with serverless when you skip the hardening step.
Context: Why We Chose Lambda + S3 for This Pipeline

The use case was straightforward. A business process generated CSV exports throughout the day — anywhere from a handful of files to several hundred during peak hours. Each file contained between 50 and 5,000 rows. We needed to ingest those rows into DynamoDB with low latency and no dedicated worker process sitting idle overnight.
Lambda with an S3 ObjectCreated trigger was the obvious fit. Pay per invocation, no idle cost, automatic scaling. We wrote the handler in Python 3.11 and used aws-lambda-powertools==2.38.0 for logging. We deployed it in a day. It worked in staging. We shipped it.
The mistake was treating “it works in staging” as equivalent to “it’s production-hardened.” Staging never sent 800 files simultaneously. Staging never had eventual-consistency edge cases. And nobody in staging was checking whether the data in DynamoDB was actually correct — just that the function returned 200.
What followed were three distinct failure modes, each one teaching us something we should have known before go-live. I’m writing this so you don’t have to learn them the same way we did.
Mistake 1: Assuming S3 Event Notifications Are Exactly-Once
This one hurt the most because the data corruption was silent. S3 event notifications are at-least-once delivery. That’s documented in the AWS S3 event notification docs, but it’s easy to skim past when you’re focused on getting the trigger wired up. Under retry conditions or eventual-consistency edge cases, S3 can fire the same event notification more than once. Lambda will execute twice for the same object. If your handler isn’t idempotent, you write the same rows twice.
We had no deduplication logic at all. The first version of our handler just iterated the CSV and called table.put_item() for every row. DynamoDB’s default put_item behavior is an upsert — it overwrites silently. So duplicate invocations didn’t raise errors. They just overwrote with identical data, and our row counts looked plausible enough that nobody noticed for two weeks.
The fix is a conditional write using a composite idempotency key built from the S3 object’s ETag and the row’s own identifier. The ETag changes when the object changes, so a re-uploaded file with new data gets processed correctly. A duplicate event for the same file gets blocked at the DynamoDB layer with a ConditionalCheckFailedException — which you catch, log as a skip, and do not re-raise.
Watch out for this: the ETag in the S3 event payload comes wrapped in double quotes — "\"abc123\"". You must call .strip('"') before using it as a key, or your idempotency check will never match.
Also watch out for the event type. We originally used s3:ObjectCreated:Put in our trigger config. That misses multipart uploads, which emit s3:ObjectCreated:CompleteMultipartUpload as a separate event type. Large files uploaded via the AWS CLI or SDK use multipart by default. Use s3:ObjectCreated:* to catch both.
Here is the handler we ended up with — idempotency guard included:
# lambda_handler.py
# Python 3.11 | aws-lambda-powertools==2.38.0
# S3 → Lambda trigger: processes uploaded CSV, writes rows to DynamoDB
# Demonstrates: idempotency via ETag, structured logging, proper error propagation
import json
import csv
import boto3
import urllib.parse
from botocore.exceptions import ClientError
from aws_lambda_powertools import Logger
from aws_lambda_powertools.utilities.typing import LambdaContext
logger = Logger(service="s3-csv-processor") # structured JSON logs to CloudWatch
s3_client = boto3.client("s3")
dynamodb = boto3.resource("dynamodb")
TABLE_NAME = "processed-records" # set via env var in production
table = dynamodb.Table(TABLE_NAME)
@logger.inject_lambda_context(log_event=True) # logs full event at DEBUG level
def handler(event: dict, context: LambdaContext) -> dict:
"""
Triggered by S3 ObjectCreated events.
Reads CSV from S3, writes each row to DynamoDB with idempotency guard.
Raises exception on unrecoverable errors — ensures Lambda retries fire.
"""
records_processed = 0
records_skipped = 0
for record in event.get("Records", []):
bucket = record["s3"]["bucket"]["name"]
# S3 keys with spaces are URL-encoded in the event payload
key = urllib.parse.unquote_plus(record["s3"]["object"]["key"])
etag = record["s3"]["object"]["eTag"].strip('"') # strip quotes AWS adds
logger.info("Processing object", extra={"bucket": bucket, "key": key, "etag": etag})
try:
response = s3_client.get_object(Bucket=bucket, Key=key)
body = response["Body"].read().decode("utf-8").splitlines()
except ClientError as e:
# Unrecoverable — object missing or permissions error; raise to trigger retry/DLQ
logger.error("Failed to fetch S3 object", extra={"error": str(e)})
raise # DO NOT swallow — Lambda must see this as a failure
reader = csv.DictReader(body)
for row in reader:
pk = f"{etag}#{row.get('id', '')}" # composite idempotency key
try:
table.put_item(
Item={"pk": pk, "etag": etag, "data": json.dumps(row)},
ConditionExpression="attribute_not_exists(pk)", # idempotency guard
)
records_processed += 1
except ClientError as e:
if e.response["Error"]["Code"] == "ConditionalCheckFailedException":
# Duplicate — already processed this row, safe to skip
logger.warning("Duplicate row skipped", extra={"pk": pk})
records_skipped += 1
else:
# Unexpected DynamoDB error — raise to trigger retry
logger.error("DynamoDB write failed", extra={"error": str(e)})
raise
logger.info(
"Processing complete",
extra={"records_processed": records_processed, "records_skipped": records_skipped},
)
return {"statusCode": 200, "processed": records_processed, "skipped": records_skipped}
Mistake 2: Swallowing Exceptions and Trusting Lambda’s Default Retry Behavior
This one is embarrassing to admit. We wrapped the entire handler body in a bare except Exception: pass block. The reasoning at the time was that we didn’t want noisy alerts firing for transient S3 errors. The result was that Lambda reported every invocation as a success, retries never fired, and rows from files that had actual processing errors simply vanished.
Lambda async invocations — which is what an S3 trigger uses — will retry up to two times on failure. But “failure” means the function raised an unhandled exception or Lambda’s runtime received an error response. If you swallow the exception and return normally, Lambda sees a success. It will not retry. It will not route to your Dead Letter Queue. The event is gone.
The tell in CloudWatch is subtle: you see "status": "success" entries with a billed_duration under 100ms for files that should have taken several seconds to process. That’s the signature of a function that hit an error early, swallowed it, and returned immediately. We had dozens of these in our logs and didn’t notice for weeks because we were only looking at the error rate metric — which was zero.
Watch out for this: a zero error rate in Lambda metrics does not mean your function is working correctly. It means it isn’t raising unhandled exceptions. Those are very different things.
The fix has three parts. First, always re-raise after logging. If you catch an exception to add context to the log message, call raise at the end of the except block — don’t return a 200. Second, configure a Dead Letter Queue pointing to SQS so that events which exhaust retries land somewhere replayable rather than disappearing. Third, set MaximumRetryAttempts explicitly in your Terraform config. The default is 2, which is fine, but having it documented in code means the next engineer doesn’t have to guess.
Also: do not make the mistake of setting your S3 trigger on both your source bucket and your destination bucket. If Lambda writes output back to S3 and that bucket also has an ObjectCreated trigger pointing at the same function, you will create an infinite invocation loop. We saw this in a test environment and watched the invocation count climb to several thousand before we killed the function. Reserved concurrency set to zero stops the bleeding — but note that setting reserved concurrency to zero also fully disables the function, so use it as an emergency brake only.
Mistake 3: Ignoring Concurrency Limits and the Cold Start Tax on Large Files
We left the Lambda timeout at 3 seconds. That is the console default. We never changed it because the function worked fine in staging, where files were small and uploads were infrequent. In production, during a burst of 800 simultaneous uploads, two things happened at once: large files started timing out at exactly 3 seconds, and the retry logic we’d now properly configured started multiplying the invocation count.
Each timeout triggered a retry. Each retry timed out again. We went from 800 invocations to over 2,400 in a few minutes. The account-level default concurrency limit is 1,000 across all functions in a region. We consumed most of it, which started throttling unrelated Lambda functions in the same account. Other teams noticed their functions were being throttled before we even knew we had a problem.
The cost side of this is also worth understanding. Lambda charges per GB-second. A 128 MB function that times out after 3 seconds costs the same as a 512 MB function that completes in 0.75 seconds — but the 512 MB version actually finishes the work, because memory allocation scales CPU proportionally. Running at 128 MB to “save money” on a parsing workload often costs more in wall-clock time and produces worse throughput. We landed on 512 MB as our minimum for any function doing I/O plus CSV parsing, and our p99 processing time dropped significantly.
We set the timeout to 30 seconds after measuring actual p99 processing time on our largest files. We set reserved concurrency to 50, which is enough for our burst patterns and prevents the function from consuming the account-level pool. These are not guesses — they came from CloudWatch metrics on duration and concurrent executions over a two-week observation period before we locked the values in Terraform.
What We Do Differently Now
Everything that used to be set by clicking through the AWS console is now in Terraform. Timeout, memory, reserved concurrency, DLQ config, retry attempts — all of it. If it isn’t in code, it doesn’t exist. That’s the rule we adopted after this incident, and it’s the single change that has prevented the most regressions.
Here is the Terraform configuration that encodes all of the lessons above. The aws_lambda_function_event_invoke_config resource is the one most teams forget — it controls async retry behavior and failure destinations, and it’s a completely separate resource from aws_lambda_function. If you only define the function resource, your retry and DLQ settings are at defaults and undocumented.
# terraform/lambda.tf
# Terraform ~>= 1.7 | AWS provider ~>= 5.x
# Shows: explicit timeout, memory, reserved concurrency, DLQ, retry config, scoped IAM
resource "aws_lambda_function" "csv_processor" {
function_name = "s3-csv-processor"
runtime = "python3.11"
handler = "lambda_handler.handler"
filename = "lambda_package.zip"
role = aws_iam_role.lambda_exec.arn
timeout = 30 # never leave at default 3s
memory_size = 512 # minimum for CSV parsing; scales CPU allocation
reserved_concurrent_executions = 50 # prevent burst from starving account
dead_letter_config {
target_arn = aws_sqs_queue.lambda_dlq.arn # failed events land here
}
environment {
variables = {
TABLE_NAME = aws_dynamodb_table.records.name
POWERTOOLS_LOG_LEVEL = "INFO"
}
}
}
# Controls async retry behavior — separate resource, easy to forget
resource "aws_lambda_function_event_invoke_config" "csv_processor" {
function_name = aws_lambda_function.csv_processor.function_name
maximum_retry_attempts = 2 # explicit; default is also 2 but document it
destination_config {
on_failure {
destination = aws_sqs_queue.lambda_dlq.arn
}
}
}
# S3 trigger — ObjectCreated:* catches Put AND CompleteMultipartUpload
resource "aws_s3_bucket_notification" "csv_upload" {
bucket = aws_s3_bucket.uploads.id
lambda_function {
lambda_function_arn = aws_lambda_function.csv_processor.arn
events = ["s3:ObjectCreated:*"] # NOT just s3:ObjectCreated:Put
filter_suffix = ".csv" # avoid triggering on every object type
}
}
# IAM: scoped to specific bucket and table — no wildcards
data "aws_iam_policy_document" "lambda_policy" {
statement {
actions = ["s3:GetObject"]
resources = ["${aws_s3_bucket.uploads.arn}/*"] # bucket-scoped, not s3:*
}
statement {
actions = ["dynamodb:PutItem"]
resources = [aws_dynamodb_table.records.arn]
}
statement {
actions = ["sqs:SendMessage"]
resources = [aws_sqs_queue.lambda_dlq.arn]
}
}
Beyond the Terraform config, we added a CloudWatch alarm on the DLQ’s ApproximateNumberOfMessagesNotEmpty metric with a threshold of zero. Any message that lands in the DLQ pages the on-call engineer immediately. We also stopped using python3.8 across all our Lambda functions — it reached end-of-support in October 2024 and no longer receives managed patches. Python 3.11 is what we standardized on, and the AWS Lambda runtimes page is worth checking quarterly to stay ahead of deprecations.
The IAM role change was also non-negotiable after this incident. Our original role had s3:GetObject on *. That means a compromised Lambda function could read any object in any bucket in the account. Scoping the resource ARN to the specific source bucket is a five-minute change that closes a significant blast radius. We now enforce this in a policy check in our CI pipeline so no new Lambda function ships with wildcard S3 access.
If you’re building an AWS Lambda S3 trigger Python pipeline today, start with idempotency, proper error propagation, and IaC-managed configuration. Those three things would have saved us the incident entirely. The architecture itself — serverless, event-driven, pay-per-invocation — is genuinely good. It just requires the same discipline as any other production system. More, actually, because the failure modes are less visible.
More on how we structure Lambda deployments and CI gates at kuryzhev.cloud.
