Kubernetes Bash Health Check Scripts vs Native Probes

When Your Pod Lifecycle Becomes a Reliability Problem

kubernetes bash health check scripts illustration

Your pod shows Ready, Kubernetes is routing traffic to it, and your users are getting 500 errors — because the default httpGet probe has no idea your database connection pool just exhausted. This is the exact scenario that forces you to seriously evaluate kubernetes bash health check scripts against native probe types. I’ve hit this wall more than once in production, and the answer is never “just use one or the other.”

Here’s what actually happens: Kubernetes evaluates livenessProbe, readinessProbe, and startupProbe through the kubelet on the node — not the API server. That’s important because it means network partitions between nodes and the control plane don’t affect probe execution. But kubelet is also limited to what you tell it to check. A tcpSocket probe verifies the port is open. Full stop. It has no idea whether the goroutine handling that port is in a deadlock. An httpGet probe returns healthy on any 2xx or 3xx — so a degraded endpoint returning HTTP 200 with {"status":"degraded"} in the body passes silently. Your SLO burns while Kubernetes happily routes traffic.

The decision point is real: lightweight native probes versus expressive custom bash scripts. Both are valid. Neither is universally correct. Let me break down exactly where each one wins and where it falls apart.


Option A — Native Kubernetes Probes (exec/httpGet/tcpSocket)

Native probes are what Kubernetes ships with: httpGet, tcpSocket, exec with a command array, and grpc (stable since 1.27). Before you write a single line of bash, you should understand what these actually give you — and where they silently fail you.

Pros:

Zero dependencies. This is the big one. If you’re running distroless images — gcr.io/distroless/base, scratch-based Go binaries, minimal Python images — there is no /bin/sh. There is no curl. An httpGet probe is handled directly by the kubelet process without forking anything into the container. It just works. At scale — 1,000+ pods — this matters. Each exec probe spawns a child process via fork() + exec(). On a node running 50 pods at periodSeconds: 5, bash exec probes generate roughly 600 process spawns per minute. On a t3.small with constrained CPU, that’s measurable noise.

Declarative and auditable. The probe config lives entirely in the Pod spec YAML. kubectl describe pod <name> shows everything — thresholds, last result, failure reason under Events:. No external script to track, no version drift between what’s in the image and what’s deployed.

Cons:

No conditional logic. You cannot express “is the DB reachable AND has the migration completed AND is the queue depth below 500” in a single native probe. You get one check per probe type.

Body-blind HTTP checks. httpGet passes on any 2xx/3xx. A degraded endpoint that your app team decorated with a 200 status code for “graceful degradation” will pass your readiness probe and keep receiving traffic. I’ve seen this cause 20-minute incidents that took forever to diagnose because Kubernetes showed all pods green.

Version dependencies. The grpc probe type requires Kubernetes 1.24+ and the application to implement grpc.health.v1. In mixed clusters or managed services that lag on minor versions, this bites you. Teams often discover this at deploy time, not planning time.

Watch out for: The default timeoutSeconds: 1 is the most commonly misconfigured field I see in production manifests. A single slow response from a JVM app during GC pause exceeds 1 second easily. Combine that with failureThreshold: 1 — which I also see regularly — and you get a pod that restarts on every minor hiccup. Set failureThreshold to at minimum 3, and 5 for JVM workloads.


Option B — Custom Bash Health Check Scripts

Writing your own exec-based bash scripts for probes means you control exactly what “healthy” means. The contract is simple: exit 0 means healthy, exit 1 means not healthy. Kubelet calls your script, reads the exit code, and acts accordingly. That simplicity is deceptive — there’s real complexity hiding in the details.

Pros:

Full conditional logic. Check HTTP response body content with curl -sf plus grep or jq. Verify process existence with pgrep. Test database connectivity with pg_isready (ships with postgresql-client) or test Redis with redis-cli PING and compare the response string. All of this in one script, composable, testable locally before it ever touches a cluster. You can even pass a $1 argument to switch between readiness and liveness behavior from the same script, reducing duplication across your spec.

Easy local testing. Run the script directly on your laptop against a local Docker Compose stack. Catch bugs before they become pod-restart bugs in staging. This feedback loop is genuinely faster than iterating on native probe YAML.

Cons:

Binary bloat and attack surface. Adding curl, jq, pg_isready, redis-cli to a minimal image increases image size and attack surface. On a distroless or scratch image, this is simply impossible without a sidecar pattern. Security teams will flag this in image scans, and they’re not wrong to do so.

Script bugs become pod-kill bugs. A bash syntax error, an unhandled edge case, or a missing exit 0 on the success path causes exit 1 at probe time. Kubelet doesn’t know the difference between “app is unhealthy” and “your script has a bug.” Both trigger restarts. Always use set -euo pipefail at the top of probe scripts, and always explicitly write exit 0 on success — relying on the implicit exit code of the last command is fragile when that last command is an echo or a variable assignment.

Watch out for: A missing shebang (#!/bin/bash or #!/bin/sh) causes exec format error in pod events. It’s one of the most confusing failures to debug because the error message doesn’t tell you the script is the problem — it just says the exec failed. Also: scripts stored in ConfigMaps are not encrypted at rest by default. Never put credentials in a ConfigMap-mounted health script. Reference $DB_PASSWORD from environment variables backed by Secrets instead.

Here’s the readiness script I use in production for multi-dependency applications. It’s mounted via ConfigMap so it can be updated without rebuilding the image:

#!/bin/bash
# readiness-check.sh
# Mounted via ConfigMap at /scripts/health/readiness-check.sh
# Used as exec readiness probe — exit 0 = ready, exit 1 = not ready
# Requires: curl, pg_isready, redis-cli in container image

set -euo pipefail

# ── Configuration from environment variables ──────────────────────────────────
APP_PORT="${APP_PORT:-8080}"
DB_HOST="${DB_HOST:-localhost}"
DB_PORT="${DB_PORT:-5432}"
DB_USER="${DB_USER:-app}"
REDIS_HOST="${REDIS_HOST:-localhost}"
REDIS_PORT="${REDIS_PORT:-6379}"
HEALTH_ENDPOINT="${HEALTH_ENDPOINT:-/internal/health}"
CURL_TIMEOUT="${CURL_TIMEOUT:-3}"

# ── Helper: log to stderr (visible in kubectl describe pod events) ─────────────
log_fail() {
  echo "[READINESS FAIL] $1" >&2
  exit 1
}

# ── Check 1: Application HTTP endpoint responds 200 with expected body ─────────
HTTP_RESPONSE=$(curl \
  --silent \
  --fail \
  --max-time "${CURL_TIMEOUT}" \
  --write-out "%{http_code}" \
  --output /tmp/health_body \
  "http://localhost:${APP_PORT}${HEALTH_ENDPOINT}" 2>/dev/null) || \
  log_fail "HTTP probe failed — curl exit code $? (port ${APP_PORT} unreachable or timeout)"

if [ "${HTTP_RESPONSE}" != "200" ]; then
  log_fail "HTTP probe returned ${HTTP_RESPONSE}, expected 200"
fi

# ── Check 2: Validate response body contains status:ok (not just 200) ──────────
# Prevents false-positive ready state from degraded-but-responding endpoints
BODY_STATUS=$(cat /tmp/health_body | grep -o '"status":"ok"' || true)
if [ -z "${BODY_STATUS}" ]; then
  log_fail "HTTP body missing '\"status\":\"ok\"' — app degraded: $(cat /tmp/health_body | head -c 200)"
fi

# ── Check 3: PostgreSQL reachability (not just port open) ──────────────────────
pg_isready \
  --host="${DB_HOST}" \
  --port="${DB_PORT}" \
  --username="${DB_USER}" \
  --timeout=2 \
  --quiet || log_fail "PostgreSQL not ready at ${DB_HOST}:${DB_PORT}"

# ── Check 4: Redis PING/PONG ───────────────────────────────────────────────────
REDIS_RESPONSE=$(redis-cli \
  -h "${REDIS_HOST}" \
  -p "${REDIS_PORT}" \
  --no-auth-warning \
  PING 2>/dev/null || echo "ERROR")

# redis-cli returns the string "PONG" — must compare explicitly, not just check exit code
if [ "${REDIS_RESPONSE}" != "PONG" ]; then
  log_fail "Redis not responding at ${REDIS_HOST}:${REDIS_PORT} — got: ${REDIS_RESPONSE}"
fi

# ── All checks passed ──────────────────────────────────────────────────────────
echo "[READINESS OK] app=200+body db=ready cache=PONG" >&2
exit 0

Decision Matrix — Choosing the Right Probe Strategy

Stop treating this as a binary choice. The real question is which probe type serves which concern best. Here’s the framework I apply on every new service:

Condition Native Only Bash Script Hybrid (Recommended)
Distroless / scratch image Native only
Single HTTP check needed Overkill Native only
Multi-dependency (DB + cache + queue) Bash readiness + native liveness
High pod density (>50/node) Watch CPU cost Native liveness, bash readiness
Kubernetes < 1.24 No grpc probe Bash for gRPC health
Slow-starting JVM / migration ✅ startupProbe Risky during init Native startup + bash readiness
Team bash proficiency low Risk of script bugs Native only until team is ready

The hybrid pattern that I reach for most often in production: native httpGet for liveness, custom bash exec for readiness, and a generous native httpGet startup probe to handle slow initialization. This combination is not a compromise — it’s the right tool for each job. Liveness controls restarts. A false positive liveness failure is catastrophic — it kills a pod that may have been processing a transaction. Keep it simple. Readiness controls traffic routing via the Endpoints object. Worth the bash overhead to get it exactly right.

You can find more patterns for managing Kubernetes workloads reliably at kuryzhev.cloud.


My Pick — Bash Readiness + Native Liveness, Every Time

I prefer the hybrid approach, and I’ll tell you exactly why in terms a code reviewer would use.

Readiness controls whether a pod receives traffic from the Service. When a readiness probe fails, Kubernetes removes the pod’s IP from the Endpoints object. Traffic stops going to it. This is the probe where you want richness — check the database, check the cache, check that the response body actually says the app is healthy, not just that port 8080 is open. The bash overhead is worth it here. Getting readiness wrong means users hit broken pods. I will take 600 extra process forks per minute on a node over a single bad deploy that serves errors for 10 minutes while all pods show green.

Liveness controls pod restarts. A false positive — a probe that reports unhealthy when the app is actually fine — causes an unnecessary restart. During high load, when GC pauses are longer, when the app is slow but functional, a too-aggressive liveness probe cascades into a restart loop that makes the problem worse. Keep liveness simple. Keep it native. A single /internal/ping endpoint that returns 200 if the process is alive is all you need. No DB check. No cache check. Those belong in readiness.

The startup probe changed how I think about initialization. Before Kubernetes 1.18 added startupProbe, teams hacked around slow-starting apps by setting huge initialDelaySeconds on liveness — which meant a genuinely crashed app wouldn’t be detected for minutes. Now I use a startup probe with failureThreshold: 30 and periodSeconds: 10 — that’s a 5-minute window for the app to become ready. Once it passes, the startup probe stops running and liveness takes over with tight thresholds. Clean separation of concerns.

Here’s the complete deployment configuration that implements this hybrid pattern. The ConfigMap mount for the script means I can update the health check logic without rebuilding the Docker image — a real operational advantage when you need to tune probe behavior in response to an incident:

# pod-probe-config.yaml
# Hybrid pattern: bash exec readiness + native httpGet liveness
# Kubernetes 1.27+ for grpc probe type and per-probe terminationGracePeriodSeconds
# ConfigMap mounts script without baking it into the image

apiVersion: v1
kind: ConfigMap
metadata:
  name: health-scripts
  namespace: production
data:
  readiness-check.sh: |
    #!/bin/bash
    # Full script content from readiness-check.sh above
    # Use kustomize secretGenerator or helm .Files.Get to inline full script in practice

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  namespace: production
spec:
  template:
    spec:
      volumes:
        - name: health-scripts
          configMap:
            name: health-scripts
            defaultMode: 0755          # ensures chmod +x — missing this causes exec format error

      containers:
        - name: api
          image: myrepo/api:1.4.2
          env:
            - name: APP_PORT
              value: "8080"
            - name: DB_HOST
              valueFrom:
                secretKeyRef:
                  name: db-credentials  # secrets NOT in ConfigMap — never put creds there
                  key: host

          volumeMounts:
            - name: health-scripts
              mountPath: /scripts/health
              readOnly: true

          # Startup probe: native httpGet, generous thresholds for slow JVM/migration init
          startupProbe:
            httpGet:
              path: /internal/ping      # lightest possible endpoint — no DB check
              port: 8080
            failureThreshold: 30        # 30 * 10s = 5 minutes max startup window
            periodSeconds: 10
            timeoutSeconds: 3

          # Liveness probe: native httpGet — simple, no subprocess, no false positives
          livenessProbe:
            httpGet:
              path: /internal/ping
              port: 8080
            initialDelaySeconds: 0      # startupProbe gates this — no delay needed
            periodSeconds: 15
            timeoutSeconds: 3
            failureThreshold: 3         # 3 failures = 45s before restart — never set to 1
            successThreshold: 1         # must be 1 for liveness — any other value = apply error

          # Readiness probe: custom bash script — rich multi-dependency logic
          readinessProbe:
            exec:
              command:
                - /scripts/health/readiness-check.sh
            periodSeconds: 10
            timeoutSeconds: 5           # must exceed CURL_TIMEOUT (3s) + pg_isready timeout (2s)
            failureThreshold: 2         # stricter than liveness — pull from LB faster
            successThreshold: 1

One last thing worth calling out: successThreshold must be exactly 1 for liveness and startup probes. If you set it to 2 thinking “I want two consecutive successes before marking alive,” kubectl apply will reject it with spec.containers[0].livenessProbe.successThreshold: Invalid value: 2. Only readiness probes support successThreshold > 1, which can be useful for flapping detection.

For the official reference on probe configuration fields and behavior, the Kubernetes documentation on liveness, readiness, and startup probes covers the full spec. For gRPC health checking specifically, the container probes lifecycle page explains the grpc probe type and version requirements.

The bottom line: use kubernetes bash health check scripts for readiness where logic complexity justifies the overhead, keep liveness native and lenient, and let the startup probe do the heavy lifting during initialization. Your on-call rotation will thank you.

Related

Leave a Reply

Your email address will not be published. Required fields are marked *

Support us · 💳 Monobank