How to Query ALB Access Logs with Amazon Athena for Incident Triage

The Scenario

When your ALB is throwing 502s at 2:47 AM and CloudWatch tells you nothing useful, the answers are already sitting in S3 — the challenge is getting to them fast enough to matter. This is a situation I’ve been in more than once. Production service, elevated target response times on the CloudWatch graph, no obvious root cause from the metrics side. The ALB logs Athena query path is what changed how we handle these incidents.

Here’s the setup that triggers this: ALB access logging is already enabled and files are landing in s3://your-bucket/alb-logs/AWSLogs/<account-id>/elasticloadbalancing/<region>/. But those files are gzip-compressed, named with timestamps, and spread across hundreds of prefixes. Trying to grep through them over SSH — even with zcat and parallel — burns 30-45 minutes you don’t have at 3 AM. The on-call engineer ends up either escalating blind or waiting for a metrics anomaly that may never come.

Athena solves this without any ETL pipeline, data movement, or Elasticsearch cluster. You point it at the S3 prefix, define the schema once, and run SQL. A query that filters a single day of logs from a service handling 1 million requests per day runs in under 90 seconds and costs fractions of a cent when partitioned correctly. We set this up once and it became the first tool every engineer reaches for during a 5xx incident.

Prerequisites

Before touching Athena, three things need to be in place. I’ve seen teams skip the first one and spend 20 minutes wondering why their table returns zero rows.

ALB access logging enabled. Navigate to EC2 → Load Balancers → select your ALB → Attributes → Access logs. Enable it and set the S3 bucket. Logs appear within approximately 5 minutes of traffic flowing through the load balancer. If you enable logging on a load balancer with no active traffic, the prefix won’t exist yet and Athena will error on the first query.

Athena workgroup with a results bucket. Create a workgroup (or use the default) and configure a query result location: s3://your-bucket/athena-results/. Enable workgroup-level encryption — SSE-S3 at minimum, SSE-KMS if you’re in a compliance-sensitive environment. Also set EnforceWorkGroupConfiguration: true. Without this flag, client-side settings can silently override your bytes-scanned cutoff, which matters when a junior engineer runs an unpartitioned query at 2 AM against 90 days of logs.

IAM permissions scoped tightly. The executing principal needs s3:GetObject and s3:ListBucket on the log bucket, plus athena:StartQueryExecution, athena:GetQueryResults, and glue:GetTable. Do not grant s3:GetObject to * even within the account. ALB logs contain full client IPs and complete request URIs — treat them as sensitive data. Apply an S3 bucket policy restricting access to specific IAM roles only.

You’ll also want the Glue database created before applying Terraform. You can do it manually with the AWS CLI:

aws glue create-database \
  --database-input '{"Name":"alb_logs"}' \
  --region us-east-1

Terraform will manage this going forward, but having it pre-created avoids a chicken-and-egg dependency on the first apply.

Step 1 — Create the Athena Table for ALB Logs

ALB logs Athena query illustration

This is where most teams get it wrong. The ALB log format as of 2024 has 29 fields. If you’re using a schema copied from a 2020 Stack Overflow answer, it’s missing classification, classification_reason, and conn_trace_id. The result isn’t a clean error — you get HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas, or worse, silent zero-row results. Both are painful to debug at 3 AM.

The second critical decision is partition projection. Without it, you have to run MSCK REPAIR TABLE every time new partitions appear. On a bucket with 12 months of ALB logs and more than 100,000 prefixes, that repair job takes 8–15 minutes and frequently times out. Partition projection eliminates this entirely — Athena computes partition paths dynamically based on the template you define.

Watch out for this: The LOCATION in your table definition must point to the prefix above AWSLogs/, not to a date-level prefix. Partition projection can only traverse downward from the declared root. If you point it at .../elasticloadbalancing/us-east-1/2024/03/15/, it cannot resolve any other partition. I’ve seen this mistake made three times on three different teams.

We manage the table as infrastructure using Terraform. Here’s the complete aws_glue_catalog_table resource with the full 29-column schema and partition projection configured. Use AWS provider ~> 5.0 and Terraform ~> 1.7:

# terraform/modules/alb-athena/main.tf
# Creates Glue catalog table with partition projection for ALB log analysis
# Tested with AWS provider ~> 5.0, Terraform ~> 1.7

variable "log_bucket" {
  description = "S3 bucket name where ALB access logs are stored"
  type        = string
}

variable "aws_account_id" {
  type = string
}

variable "aws_region" {
  type    = string
  default = "us-east-1"
}

resource "aws_glue_catalog_database" "alb_logs" {
  name = "alb_logs"
}

resource "aws_glue_catalog_table" "alb_access_logs" {
  name          = "access_logs"
  database_name = aws_glue_catalog_database.alb_logs.name

  table_type = "EXTERNAL_TABLE"

  parameters = {
    "EXTERNAL"                        = "TRUE"
    "projection.enabled"              = "true"
    "projection.year.type"            = "integer"
    "projection.year.range"           = "2023,2030"  # extend range if table outlives 2030
    "projection.month.type"           = "integer"
    "projection.month.range"          = "1,12"
    "projection.month.digits"         = "2"
    "projection.day.type"             = "integer"
    "projection.day.range"            = "1,31"
    "projection.day.digits"           = "2"
    # Storage location template — must match ALB's actual S3 prefix structure
    "storage.location.template"       = "s3://${var.log_bucket}/AWSLogs/${var.aws_account_id}/elasticloadbalancing/${var.aws_region}/$${year}/$${month}/$${day}"
    "classification"                  = "text"
    "compressionType"                 = "gzip"
    "typeOfData"                      = "file"
  }

  storage_descriptor {
    location      = "s3://${var.log_bucket}/AWSLogs/${var.aws_account_id}/elasticloadbalancing/${var.aws_region}/"
    input_format  = "org.apache.hadoop.mapred.TextInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"

    ser_de_info {
      serialization_library = "org.apache.hadoop.hive.serde2.RegexSerDe"
      parameters = {
        # Full 29-column ALB log regex — do not trim columns or schema mismatch errors occur
        "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) (.*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-_]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\\s]+?)\" \"([^\\s]+)\" \"([^ ]*)\" \"([^ ]*)\""
      }
    }

    # All 29 columns in order matching the regex capture groups
    dynamic "columns" {
      for_each = [
        { name = "type",                    type = "string" },
        { name = "time",                    type = "string" },
        { name = "elb",                     type = "string" },
        { name = "client_ip",               type = "string" },
        { name = "client_port",             type = "int" },
        { name = "target_ip",               type = "string" },
        { name = "target_port",             type = "int" },
        { name = "request_processing_time", type = "double" },
        { name = "target_processing_time",  type = "double" },
        { name = "response_processing_time",type = "double" },
        { name = "elb_status_code",         type = "string" },  # string, not int — critical
        { name = "target_status_code",      type = "string" },
        { name = "received_bytes",          type = "bigint" },
        { name = "sent_bytes",              type = "bigint" },
        { name = "request_verb",            type = "string" },
        { name = "request_url",             type = "string" },
        { name = "request_proto",           type = "string" },
        { name = "user_agent",              type = "string" },
        { name = "ssl_cipher",              type = "string" },
        { name = "ssl_protocol",            type = "string" },
        { name = "target_group_arn",        type = "string" },
        { name = "trace_id",                type = "string" },
        { name = "domain_name",             type = "string" },
        { name = "chosen_cert_arn",         type = "string" },
        { name = "matched_rule_priority",   type = "string" },
        { name = "request_creation_time",   type = "string" },
        { name = "actions_executed",        type = "string" },
        { name = "redirect_url",            type = "string" },
        { name = "lambda_error_reason",     type = "string" },
        { name = "target_port_list",        type = "string" },
        { name = "target_status_code_list", type = "string" },
        { name = "classification",          type = "string" },
        { name = "classification_reason",   type = "string" },
        { name = "conn_trace_id",           type = "string" },
      ]
      content {
        name = columns.value.name
        type = columns.value.type
      }
    }
  }

  partition_keys {
    name = "year"
    type = "int"
  }
  partition_keys {
    name = "month"
    type = "int"
  }
  partition_keys {
    name = "day"
    type = "int"
  }
}

output "athena_database" {
  value = aws_glue_catalog_database.alb_logs.name
}

output "athena_table" {
  value = aws_glue_catalog_table.alb_access_logs.name
}

One note on the projection.year.range: I set it to 2023,2030. If that range is too narrow and a future date falls outside it, Athena returns zero rows with no error. Set it generously and update it as part of your annual infrastructure review.

Step 2 — Write Triage Queries for the Incident

Here’s the actual SQL we run during an active incident. Not toy examples — the real queries from our runbook, with the gotchas annotated inline.

Two things to know before running these. First, always use Athena engine version 3. Version 2 has known issues with parse_datetime on ALB’s ISO 8601 timestamp format. In engine v3, use from_iso8601_timestamp(time) — not date_parse(), which will fail silently on the microsecond component of ALB timestamps like 2024-03-15T02:47:13.123456Z.

Second, elb_status_code is a string in the raw log. Filter with elb_status_code = '502', not = 502. An integer comparison against a string column in Athena/Presto doesn’t throw an error — it just matches nothing. I spent 15 minutes on this exact bug once. Never again.

-- queries/incident_triage.sql
-- Run these in sequence during a 5xx incident triage session
-- Engine: Athena v3 | Replace date values with incident window

-- QUERY 1: Error rate by 5-minute bucket and target IP
-- Identifies which backend instance is generating errors
SELECT
  date_trunc('minute', from_iso8601_timestamp(time))
    - interval '1' minute * (minute(from_iso8601_timestamp(time)) % 5) AS bucket,
  target_ip,
  elb_status_code,
  COUNT(*) AS request_count
FROM alb_logs.access_logs
WHERE year = 2024
  AND month = 3
  AND day = 15
  AND elb_status_code IN ('502', '503', '504')  -- string comparison, not integer
GROUP BY 1, 2, 3
ORDER BY bucket DESC, request_count DESC;

-- QUERY 2: Slowest endpoints during the incident window
-- target_processing_time = -1 means no backend response; exclude from avg
SELECT
  request_url,
  COUNT(*)                                                   AS hit_count,
  ROUND(AVG(target_processing_time), 3)                     AS avg_target_s,
  ROUND(approx_percentile(target_processing_time, 0.99), 3) AS p99_target_s,
  MAX(target_processing_time)                               AS max_target_s
FROM alb_logs.access_logs
WHERE year = 2024
  AND month = 3
  AND day = 15
  AND target_processing_time != -1.0               -- exclude timed-out connections
  AND from_iso8601_timestamp(time)
      BETWEEN timestamp '2024-03-15 02:40:00 UTC'
          AND timestamp '2024-03-15 03:10:00 UTC'
GROUP BY request_url
HAVING COUNT(*) > 10
ORDER BY p99_target_s DESC
LIMIT 20;

-- QUERY 3: Top client IPs by error volume — rule out bad actor / upstream misconfiguration
SELECT
  client_ip,
  COUNT(*)                    AS total_errors,
  COUNT(DISTINCT request_url) AS distinct_endpoints,
  MIN(time)                   AS first_seen,
  MAX(time)                   AS last_seen
FROM alb_logs.access_logs
WHERE year = 2024
  AND month = 3
  AND day = 15
  AND elb_status_code IN ('502', '503', '504')
GROUP BY client_ip
ORDER BY total_errors DESC
LIMIT 15;

Query 1 tells you which backend IP started failing and when. Query 2 tells you which endpoints are slow — and note that target_processing_time is in seconds, not milliseconds. I’ve seen engineers misread a value of 4.8 as 4.8 milliseconds and conclude there’s no latency problem. It’s 4.8 seconds. That misread leads to completely wrong SLO calculations. Query 3 rules out a single bad actor — sometimes a misconfigured upstream service or a runaway client is the actual cause of the 5xx spike, not your backend at all.

Also worth noting: received_bytes and sent_bytes are -1 for failed connections. If you’re doing bandwidth analysis alongside the error triage, add WHERE received_bytes > 0 to those queries or your averages will be meaningless.

Step 3 — Optimize Queries for Cost and Speed

Athena charges $5.00 per terabyte scanned. That sounds cheap until you’re running 50 queries during a 2-hour incident bridge call. A busy ALB handling 1 million requests per day generates roughly 20–40 GB of compressed logs per month. An unpartitioned query scanning 30 days costs $0.10–$0.20 per execution. Multiply by 50 queries and you’re looking at $5–$10 in Athena costs for a single incident. More importantly, those unpartitioned queries are slow — and slow queries during an incident are unacceptable.

The fix is simple: always filter on partition columns first. A query scoped to a single day with partition projection active scans roughly 1–3 GB instead of hundreds. The cost difference is two orders of magnitude.

A few additional practices we enforce on our team:

  • Use LIMIT on all exploratory queries until you understand the data distribution.
  • Use APPROX_COUNT_DISTINCT(client_ip) instead of COUNT(DISTINCT client_ip) for cardinality estimates. Exact distinct counts in Presto/Trino are disproportionately expensive on large datasets and the approximation is accurate enough for triage purposes.
  • Set a workgroup-level bytes-scanned cutoff: BytesScannedCutoffPerQuery: 10737418240 (10 GB). This is a hard stop that prevents a full-history scan from completing. With EnforceWorkGroupConfiguration: true set, this cannot be overridden by client settings.
  • Store your triage queries in a queries/ directory in your runbook repository. On-call engineers should be able to copy-paste a query, substitute the date, and run it — not write SQL from scratch at 3 AM.

I stopped using the Athena console’s “Recent queries” history as a runbook substitute after one incident where the query a colleague had run the previous week was no longer in the history. Git-committed SQL files don’t have that problem.

Verify and Test

Set this up before the next incident, not during it. The verification process is straightforward but important.

Test with a known-good date partition. Run a simple SELECT COUNT(*) FROM alb_logs.access_logs WHERE year = 2024 AND month = 3 AND day = 14 against yesterday’s data. If the result is zero, you have a misconfiguration — either the S3 path doesn’t match the partition projection template, or the projection year range doesn’t include the current year. Zero rows with no error is the most confusing failure mode in the whole setup.

Validate column types explicitly. Run this against a small sample before relying on it in production:

-- Verify schema and timestamp parsing work correctly
SELECT
  time,
  from_iso8601_timestamp(time)   AS parsed_time,
  target_processing_time,
  received_bytes,
  elb_status_code
FROM alb_logs.access_logs
WHERE year = 2024
  AND month = 3
  AND day = 14
LIMIT 10;

If from_iso8601_timestamp(time) throws a parse error, you’re on Athena engine version 2. Upgrade the workgroup to engine version 3 in the console under Workgroup settings. If target_processing_time returns null for rows that should have values, the regex column count is off — recount your capture groups against the full 29-column regex.

Confirm the workgroup cutoff is active. Run a query without a partition filter and verify it fails with a bytes-scanned limit error rather than completing. If it completes, EnforceWorkGroupConfiguration is not set. Fix that before handing this off to the broader team.

We also include a quarterly test in our runbook: pull a specific known request from a recent date using its X-Amzn-Trace-Id header value via the trace_id column. If that query returns the expected row, the full pipeline — logging, S3 delivery, Glue schema, partition projection — is confirmed working end-to-end.

For the official ALB log format reference and field definitions, see the AWS ALB access logs documentation. For Athena partition projection configuration details, the Athena partition projection docs are the authoritative reference — especially the section on integer projection type and digit padding, which affects how month and day values are zero-padded in the S3 path template.

If you’re building out the broader observability stack around this, the monitoring and infrastructure articles on kuryzhev.cloud cover complementary patterns including alerting pipelines and log retention strategies that pair well with this Athena setup.

Once this pattern is in place, the natural next step is connecting Athena to QuickSight for persistent dashboards — error rate trends, p99 latency by endpoint, top client IPs over time — without any additional data pipeline. Another direction worth exploring is triggering Lambda-based triage queries automatically from PagerDuty webhooks: when an alert fires, a Lambda function runs the three queries above, formats the output, and posts it to the incident Slack channel before any human touches a keyboard. The ALB logs Athena query setup described here is the foundation both of those patterns build on. Get the schema right, get the partitioning right, and the rest follows.

Related

Leave a Reply

Your email address will not be published. Required fields are marked *

Support us · 💳 Monobank