PostgreSQL Logical Replication Across Kubernetes StatefulSets

devops-database

A single deleted subscriber pod can leave an orphaned replication slot that quietly fills your primary’s disk until Postgres stops accepting writes. We found this out the hard way on a cluster that had been running fine for months — until a routine StatefulSet scale-down turned into a 2am page about disk pressure on the publisher’s PVC. That incident is basically why this post exists: postgresql logical replication kubernetes deployments have failure modes that don’t show up in the official docs, because those docs assume a static host, not ephemeral pods with rescheduled IPs and reclaimed volumes.

What Logical Replication Actually Does at the Wire and Pod Level

postgresql logical replication kubernetes illustration

Logical replication in Postgres works by decoding the write-ahead log (WAL) through the pgoutput plugin and streaming logical changes — inserts, updates, deletes — to a subscriber, as opposed to physical/streaming replication, which ships raw WAL blocks byte-for-byte. This distinction matters because logical replication is per-database, not per-cluster. You define a PUBLICATION on one database and a SUBSCRIPTION that consumes it, and each side runs its own independent WAL sender / apply worker pair.

On the wire, the publisher holds a replication slot — a durable marker of how far the subscriber has consumed WAL — and the subscriber runs a persistent background worker process that maintains a long-lived TCP connection back to the publisher. In a Kubernetes context, that connection has to survive pod restarts, node drains, and rescheduling. It doesn’t go through a Pod IP directly; it goes through a Service, and which Service you choose determines whether replication survives a rollout or silently breaks.

This is where StatefulSet pod identity becomes non-negotiable. A headless Service gives each pod a stable DNS name like pg-publisher-0.pg-publisher-headless.prod.svc.cluster.local that persists across restarts because it’s tied to the ordinal, not the ephemeral IP. Plain Deployments don’t give you this — pod names are randomized suffixes, and there’s no guaranteed 1:1 mapping between “the pod that owns slot X” and “the pod you’re currently talking to.” That mismatch is subtle and it will bite you months after the initial setup looks fine.

How People Set This Up Wrong in Kubernetes

The most common mistake I see is pointing the subscription connection string at a ClusterIP or LoadBalancer Service instead of the headless Service with ordinal DNS. It looks correct at first — replication starts, data flows, everyone moves on. Then a rolling restart happens, the ClusterIP load-balances a new connection to a different pod (maybe a read replica that isn’t even the primary), and the subscription silently stalls. No crash, no obvious error in the subscriber logs — just growing lag that nobody notices until a report shows stale data.

Second mistake: not setting wal_level = logical before the first StatefulSet rollout. This parameter requires a full Postgres restart, not a config reload. If you bake it into the ConfigMap after pods are already running, you need to restart every replica pod manually — a rolling restart via kubectl rollout restart alone won’t apply it unless the pod actually restarts the Postgres process, and even then, ordering matters for HA setups.

Third, and this is the one that caused our incident: ignoring orphaned replication slots. When a subscriber pod and its PVC get deleted — during a scale-down, a namespace migration, whatever — the slot on the publisher doesn’t get cleaned up automatically. Postgres keeps the WAL around indefinitely because the slot says “I haven’t consumed this yet,” even though nobody is ever going to consume it again. WAL accumulates unbounded until the publisher’s PVC fills and writes fail cluster-wide.

Fourth, people assume PodDisruptionBudgets or StatefulSet ordinal guarantees somehow protect replication state. They don’t. PDBs protect availability during voluntary disruptions; they say nothing about application-level state like a stale replication slot. Slot lifecycle is entirely your responsibility.

The Correct Approach — Wiring Replication Across StatefulSets

On the publisher StatefulSet, three parameters matter most: wal_level = logical, max_replication_slots sized to your subscriber count plus a buffer, and max_wal_senders sized to cover both logical slots and any physical replicas you’re running for HA. Undersizing either of the last two gives you a cryptic ERROR: number of requested standby connections exceeds max_wal_senders exactly when you scale up subscribers.

Networking should always go through a headless Service (clusterIP: None) with the subscriber connecting via the stable ordinal DNS pattern: host=pg-0.pg-headless.prod.svc.cluster.local. Use a dedicated replication role with only the REPLICATION grant — never superuser. It’s a least-privilege boundary that costs nothing to set up and saves you when credentials leak.

Here’s the publisher-side StatefulSet configuration we run in production, with the config baked in before first rollout:

apiVersion: v1
kind: Service
metadata:
  name: pg-publisher-headless
  namespace: prod
spec:
  clusterIP: None          # headless — required for stable per-pod DNS
  selector:
    app: pg-publisher
  ports:
    - port: 5432
      name: postgres
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: pg-publisher-conf
  namespace: prod
data:
  postgresql.conf: |
    wal_level = logical               # requires full restart, not reload
    max_replication_slots = 10        # size for subscribers + buffer
    max_wal_senders = 15               # slots + physical replicas
    max_wal_size = 2GB
    hot_standby_feedback = on
  pg_hba.conf: |
    hostssl replication repl_user 10.244.0.0/16 scram-sha-256
    hostssl all repl_user 10.244.0.0/16 scram-sha-256
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: pg-publisher
  namespace: prod
spec:
  serviceName: pg-publisher-headless
  replicas: 1                          # logical publisher = single writer node
  selector:
    matchLabels:
      app: pg-publisher
  template:
    metadata:
      labels:
        app: pg-publisher
    spec:
      containers:
        - name: postgres
          image: postgres:15.4
          volumeMounts:
            - name: pgdata
              mountPath: /var/lib/postgresql/data
            - name: conf
              mountPath: /etc/postgresql/postgresql.conf
              subPath: postgresql.conf
          args: ["-c", "config_file=/etc/postgresql/postgresql.conf"]
      volumes:
        - name: conf
          configMap:
            name: pg-publisher-conf
  volumeClaimTemplates:
    - metadata:
        name: pgdata
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 50Gi   # size with WAL retention headroom, ~20-30% buffer

Postgres has no native CREATE SUBSCRIPTION IF NOT EXISTS, so subscription setup needs to be idempotent by hand — usually in an init container or a Job, never a manual psql session that nobody remembers running. Slot cleanup should be tied to a PreStop hook with a terminationGracePeriodSeconds long enough (30–60s) for the drop to actually complete before the pod dies:

# Idempotent subscription creation, run from an init container on the
# subscriber StatefulSet — targets the publisher's stable pod DNS.

PUBLISHER_HOST="pg-publisher-0.pg-publisher-headless.prod.svc.cluster.local"

psql -v ON_ERROR_STOP=1 -U repl_user -d app <<-SQL
  DO \$\$
  BEGIN
    IF NOT EXISTS (
      SELECT 1 FROM pg_subscription WHERE subname = 'sub_app_data'
    ) THEN
      CREATE SUBSCRIPTION sub_app_data
      CONNECTION 'host=${PUBLISHER_HOST} port=5432 dbname=app user=repl_user sslmode=verify-full'
      PUBLICATION pub_app_data
      WITH (copy_data = true, create_slot = true, slot_name = 'slot_subscriber_0');
    END IF;
  END
  \$\$;
SQL

# Verify lag and slot health from the publisher side afterward:
# SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
# FROM pg_replication_slots WHERE slot_name = 'slot_subscriber_0';

If you see ERROR: could not connect to the publisher: server closed the connection unexpectedly, it's almost always a stale headless Service DNS cache after a pod got rescheduled — check whether the ordinal you're connecting to is actually the pod you think it is. If you see WARNING: skipping publication "pub_name" because subscriber requires WAL level >= logical, someone shipped a config change without the required restart on the publisher.

Advanced Patterns — Multi-Cluster and Failover-Aware Replication

Once you're past a single publisher/subscriber pair, things get more interesting. Cross-namespace or cross-cluster replication typically bridges StatefulSets through ExternalName Services, or through a mesh like Cilium or Istio, which lets you enforce mTLS at the network layer instead of relying purely on pg_hba.conf entries — a good idea, since replication traffic is unencrypted TCP by default within cluster networking unless you explicitly force sslmode=verify-full and issue certs via cert-manager.

Failover is the part people underestimate. If your publisher is managed by Patroni, replication slots are not part of Patroni's DCS-synced state by default. When the primary fails over, the new primary doesn't automatically inherit the old slot's position — subscriptions need to be re-pointed to the new primary's Service endpoint, and slots need to be re-created or restored via pg_failover_slots or a callback script triggered by Patroni's failover hooks. Skip this and you get silent data gaps that look like replication lag but are actually missing transactions.

PostgreSQL 15+ also supports row and column filters on publications — a WHERE clause that limits what gets replicated. In multi-tenant sharded setups, this is genuinely useful: you can replicate only tenant-specific rows between StatefulSets without building a custom CDC pipeline. PG16 adds max_sync_workers_per_subscription and parallel apply, which meaningfully speeds up initial sync on large tables — worth the upgrade if you're still on 15.x and doing big table syncs regularly.

Performance Notes — What Actually Costs You

WAL decoding overhead scales per-subscription, not per-row. Each active slot pins a decoding process, and if you stack 20+ subscriptions against one publisher pod, you'll saturate CPU limits before you ever come close to network bandwidth limits. Size your publisher's CPU requests with that in mind — it's not a bandwidth problem, it's a scheduling problem.

Storage is the sneaky cost. A dead or lagging slot keeps WAL pinned on the publisher's PVC, and on high-write clusters that can grow tens of gigabytes per day — directly hitting your EBS gp3 or PD-SSD billing. Plan PVC size with 20-30% headroom above expected WAL volume, and alert on pg_replication_slots.active = false before it becomes a disk-full incident, not after.

Initial table sync is the other cost trap. Before PG16, the COPY that runs on subscription creation has no throttling knob — it's a full-table read at whatever speed the network allows. For large tables, this can saturate pod-to-pod bandwidth and, depending on your CNI's network policy enforcement, actually trigger rate limiting that makes the sync look stuck rather than just slow. We've hit this specifically with Cilium 1.14 enforcing strict per-pod bandwidth policies — the fix was temporarily relaxing the policy during initial sync windows, not debugging Postgres.

Monitor lag proactively with a simple query against pg_replication_slots:

SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots;

If you're running Postgres on Kubernetes at any real scale, treat postgresql logical replication kubernetes wiring as an operator responsibility, not a one-time setup task. Slots don't clean themselves up, failovers don't re-point subscriptions automatically, and the official docs won't warn you about any of it. We've covered more Kubernetes operational patterns like this over at kuryzhev.cloud if you want the broader context on running stateful workloads reliably.

For the underlying mechanics referenced here, the PostgreSQL logical replication documentation and the Kubernetes StatefulSet docs are worth keeping open in a tab while you build this out.

Related

Leave a Reply

Your email address will not be published. Required fields are marked *

Support us · 💳 Monobank