WordPress php-fpm Profiling: Three Mistakes That Hid the Real Bottleneck

Context: We Had a Slow WordPress Site and Thought We Knew Why

WordPress performance profiling with php-fpm status illustration

We had php-fpm slow logging enabled for six weeks before realizing every slow request was silently disappearing into /dev/null — and php-fpm never complained once. That single config mistake cost us weeks of debugging time and a lot of misplaced confidence.

The stack was straightforward: WordPress 6.4, php-fpm 8.1, nginx 1.24, MariaDB 10.6 on a 4-core VPS with 16GB RAM. A WooCommerce storefront with moderate traffic — nothing exotic. Monitoring was New Relic free tier plus basic uptime checks from an external probe. We could see response times. We could not see why they were spiking.

The trigger was p95 response times climbing to 8 seconds or more on checkout pages during business hours. New Relic showed the slowness. It did not show where inside php-fpm the time was going. We assumed it was a database problem because that is always the first guess. It was not the database.

What we had was three compounding mistakes that together made the real bottleneck invisible for longer than I would like to admit.

Mistake 1: We Ignored php-fpm Pool Status Because “It Looked Fine”

We did have /status enabled. That part was correct. The problem was what we were actually watching.

We checked active processes versus max children. Active was usually 6–10 out of a configured max of 40. That looked healthy. We moved on.

What we were not watching was slow requests. That counter was climbing — silently, steadily — and we completely missed it because we made a fundamental misreading of what it represents.

Here is the gotcha: slow requests in the /status output is a monotonic counter since the process manager started. It does not reset on systemctl reload php8.1-fpm. Reloading resets active and idle counts, but slow requests and accepted conn survive reloads. We were reloading after every config change and assuming the counter reset. It did not. We were looking at a number that meant “total slow requests since this pool started” and treating it as “slow requests right now.”

The other field we ignored was listen queue. A non-zero listen queue means requests are waiting before php-fpm has even started processing them. That is a different problem from slow execution — it means the pool is saturated at the accept level. During our checkout spikes, listen queue was hitting 12–15. We never saw it because we never looked.

The full set of fields worth watching: idle, active, max active, slow requests, listen queue, and listen queue len. Also accepted conn as a rate proxy. Most people read two of these. You need all of them, and you need the delta on the monotonic ones — not the raw value.

Also worth knowing: pm.status_path = /status must be set per pool in /etc/php/8.1/fpm/pool.d/wordpress.conf. Setting it only in www.conf does not expose it for other pools. We had that wrong too, which meant we were watching the wrong pool’s status for a while.

Mistake 2: Slow Log Was Enabled But Pointed at /dev/null in the Wrong Pool

This one still bothers me.

We had two pools: www.conf and wordpress.conf. The wordpress.conf pool was copied from a template during initial setup. That template had slowlog = /dev/null as a placeholder. We set request_slowlog_timeout = 5s in the pool config — correctly — but we never overrode the slowlog path.

So for six weeks, every request that took more than 5 seconds generated a slow log entry that was immediately written to /dev/null and discarded. No error. No warning. No indication in /var/log/php8.1-fpm.log that anything was wrong. php-fpm 8.1 accepts /dev/null as a valid slowlog path and never complains about it.

Also worth noting: request_slowlog_timeout = 0 disables slow logging entirely, and that is the default. So if you have never explicitly set this value, silence in your slow log does not mean there are no slow requests. It means slow logging is off. These two failure modes — disabled by default, silently discarded when misconfigured — make the slow log easy to believe you have when you actually do not.

We only found this by running php-fpm -tt to dump the fully parsed config with all inherited values and grepping for the slowlog path:

php-fpm8.1 -tt 2>&1 | grep -E 'slowlog|request_slowlog_timeout'

The output showed it immediately:

[wordpress]
slowlog = /dev/null
request_slowlog_timeout = 5s

We now run that command after every deploy that touches pool config. It takes three seconds and it has already caught two more inherited-default problems since we added it to our post-deploy checklist.

Mistake 3: We Tuned pm.max_children Based on RAM Alone, Ignoring CPU Saturation

The classic formula: pm.max_children = total_RAM / avg_process_size. We ran it. We got 40. We set 40. We felt good about it.

What we did not check was CPU steal on the VPS or load average under actual WooCommerce traffic. A 4-core box with 40 php-fpm workers under a cache-miss burst does not serve 40 requests in parallel. It context-switches constantly, and every context switch adds latency to every in-flight request. We were making slow requests slower by adding more workers.

When we tested pm.max_children = 12 with pm = ondemand in the same traffic window, p99 latency dropped by roughly 40%. Fewer workers, less contention, faster individual responses.

The other problem with pm = dynamic at high max_children is thundering herd behavior on cache-miss bursts. When a cache expires and 30 requests hit php-fpm simultaneously, dynamic mode tries to spin up workers fast. Under CPU pressure that startup latency compounds the problem.

I stopped using pm = dynamic for profiling phases after this incident. pm = ondemand with pm.process_idle_timeout = 10s gives you clean per-request data because workers do not carry stale state between measurements. Profile first, then decide on pm mode. Do not tune max_children before you know what your requests are actually doing.

What We Do Differently Now

The real bottleneck, once we could actually see it, was a third-party WooCommerce plugin making 47 synchronous HTTP calls per checkout page load. Forty-seven. Each one blocking. The slow log stack traces showed wp-includes/class-http.php at the top of every 6-second entry — which almost always means a blocking external HTTP call. We would have found that in day one if the slow log had been working.

Here is the polling script we now run as a systemd service to scrape /status every 30 seconds and ship metrics to InfluxDB via Telegraf tail input. The key part is computing the delta on slow_requests rather than alerting on the raw value:

#!/usr/bin/env bash
# php-fpm-status-poller.sh
# Polls php-fpm status endpoint every 30s, extracts key metrics,
# and appends InfluxDB line protocol to a file for Telegraf tail input.
# Tested: php-fpm 8.1, nginx 1.24, Ubuntu 22.04

POOL_NAME="wordpress"
STATUS_URL="http://127.0.0.1/fpm-status?json"   # nginx must proxy this internally
OUTPUT_FILE="/var/log/php-fpm-metrics.log"
INTERVAL=30

# Baseline: capture slow_requests at script start to compute delta
PREV_SLOW=0

while true; do
  TIMESTAMP=$(date +%s%N)  # nanoseconds required for InfluxDB line protocol

  # Fetch status JSON; fail silently if fpm is restarting
  PAYLOAD=$(curl -sf --max-time 3 "${STATUS_URL}")
  if [[ -z "$PAYLOAD" ]]; then
    echo "$(date): WARNING - fpm status unreachable" >&2
    sleep "${INTERVAL}"
    continue
  fi

  # Parse fields using jq — all values are integers in fpm JSON output
  ACTIVE=$(echo "$PAYLOAD"       | jq '.["active processes"]')
  IDLE=$(echo "$PAYLOAD"         | jq '.["idle processes"]')
  LISTEN_Q=$(echo "$PAYLOAD"     | jq '.["listen queue"]')
  MAX_ACTIVE=$(echo "$PAYLOAD"   | jq '.["max active processes"]')
  SLOW_TOTAL=$(echo "$PAYLOAD"   | jq '.["slow requests"]')
  ACCEPTED=$(echo "$PAYLOAD"     | jq '.["accepted conn"]')

  # Compute delta — monotonic counter, do NOT alert on raw value
  SLOW_DELTA=$(( SLOW_TOTAL - PREV_SLOW ))
  PREV_SLOW=$SLOW_TOTAL

  # Alert condition: more than 5 slow requests in one interval is worth a page
  if (( SLOW_DELTA > 5 )); then
    echo "$(date): ALERT - ${SLOW_DELTA} slow requests in last ${INTERVAL}s (pool: ${POOL_NAME})" \
      | tee -a /var/log/php-fpm-alerts.log
    # Hook your alerting here: curl to PagerDuty, Slack webhook, etc.
  fi

  # Write InfluxDB line protocol: measurement,tags fields timestamp
  echo "phpfpm_pool,pool=${POOL_NAME},host=$(hostname) \
active=${ACTIVE}i,\
idle=${IDLE}i,\
listen_queue=${LISTEN_Q}i,\
max_active=${MAX_ACTIVE}i,\
slow_requests_total=${SLOW_TOTAL}i,\
slow_requests_delta=${SLOW_DELTA}i,\
accepted_conn=${ACCEPTED}i \
${TIMESTAMP}" >> "${OUTPUT_FILE}"

  sleep "${INTERVAL}"
done

For slow log analysis during incidents, we parse stack traces by frequency to find the most common offending files. This one-liner has been more useful than any APM dashboard during an active incident:

# Find top offending files from php-fpm slow log
# Run ad-hoc during incident — shows which scripts appear most in slow traces
grep -h "^script_filename" /var/log/php8.1-fpm-wordpress-slow.log \
  | sort | uniq -c | sort -rn | head -20

# Expected output when a plugin is the culprit:
# 47 script_filename = /var/www/html/wp-content/plugins/bad-plugin/checkout.php

The other changes: request_slowlog_timeout = 2s in production, 1s in staging. Lower thresholds catch more. The slow log is verbose at 1s but that is exactly what you want in staging — you want to see everything. And the php-fpm -tt 2>&1 | grep -E 'slowlog|timeout' check runs automatically in our post-deploy hook for any change touching pool configs.

For more on the nginx side of this stack and how we handle upstream timeouts that interact with php-fpm behavior, see the DevOps_DayS archives at kuryzhev.cloud.

Official references worth bookmarking: the php-fpm configuration documentation covers every pool directive including the ones that default to silent no-ops, and the FPM status page documentation explains exactly what each counter measures and when it resets.

The 47 HTTP calls per checkout are gone now. p95 on checkout is under 800ms. None of that required a bigger server. It required being able to actually read what php-fpm was telling us.

Leave a Reply

Your email address will not be published. Required fields are marked *

Support us · 💳 Monobank