Automate Cloudflare DNS Failover with Ansible During Incidents

Your origin goes down at 2am — and your “automated” Cloudflare DNS failover is actually a Slack message asking someone to run a playbook manually. That someone is asleep. Their vault credentials are on a laptop that’s also asleep. By the time the record gets swapped, you’ve lost 20 minutes of traffic and your SLO is in the floor. I’ve seen this exact scenario play out three times across different teams. The fix isn’t complicated, but it requires understanding what Cloudflare DNS failover actually does before you write a single line of Ansible.

What Cloudflare DNS Failover Actually Does

Cloudflare DNS failover Ansible illustration

Here’s the first thing most tutorials get wrong: Cloudflare DNS failover is not automatic unless you’re using their paid Load Balancer product. The $5/month-per-hostname Load Balancer gives you health checks, origin pools, and automatic record switching. What we’re building here is something different — programmatic DNS record manipulation via the Cloudflare API, triggered by your monitoring stack. These are two completely different mechanisms, and conflating them is how teams end up with false confidence in their runbooks.

When you update an A record via the Cloudflare API on a proxied hostname, Cloudflare’s edge reflects the change almost immediately — their internal propagation is fast. The real recovery window is determined by upstream resolvers that cached your previous record. Even with ttl: 1 (Cloudflare auto TTL), external resolvers may hold the old value for up to the TTL that was in effect when they last queried. This is why pre-staging your failover record matters enormously. If the backup A record has existed for weeks, resolvers have already cached it at TTL=1 and will pick up the swap quickly. If you create it during the incident, you’re starting cold.

The Ansible approach here uses community.cloudflare collection (v1.0.0+) alongside direct ansible.builtin.uri calls to the Cloudflare REST API. Install the collection with:

ansible-galaxy collection install community.cloudflare:>=1.0.0
ansible-galaxy collection install community.dns:>=2.5.0
# Also required on the control node:
pip install "dnspython==2.4.2"
# Pin to 2.4.2 — version 2.5.0 introduced a breaking timeout regression
# that causes community.dns.lookup to hang without an explicit timeout param

One critical internal detail: Cloudflare’s API always returns ttl: 1 for proxied records regardless of what value you set. Do not use the API-returned TTL as a health signal in your pre-flight checks. It will always be 1 and tells you nothing about propagation state.

How Teams Get This Wrong in Production

I’ve audited a lot of “DNS failover runbooks” and the same three mistakes appear almost universally.

Mistake 1: Manual trigger during incidents. The playbook exists, it works in staging, and it’s documented in Confluence. But it requires a human to wake up, authenticate, find the runbook, and have working vault credentials. That human latency adds 10–25 minutes to your MTTR. A failover mechanism that depends on a person being awake and functional at 3am is not a failover mechanism — it’s a recovery procedure. These are not the same thing.

Mistake 2: Creating the failover record during the incident. I stopped doing this after a particularly bad incident where we spent eight minutes troubleshooting why the new record wasn’t propagating — it was because resolvers had zero TTL history for it. The standby A record should exist permanently, pointing at your standby IP, proxied, at all times. You’re not hiding it. You’re just not routing production traffic to it yet. When failover happens, you’re changing the content of an existing record, not creating a new one. That distinction matters for propagation speed.

Mistake 3: Not pinning zone_id explicitly. The community.general.cloudflare_dns module does a zone name lookup on every run. Under normal conditions, that’s fine. During a major incident when you’re running multiple playbook tasks simultaneously, you can hit Cloudflare’s rate limit of 1200 requests per 5 minutes per token. When that happens, the API returns Error 10000: Authentication error — which is completely misleading and gets misdiagnosed as a credentials problem every single time. Always pull zone_id from Ansible vault and pass it directly. Never let the module resolve it at runtime.

Watch out: Cloudflare occasionally returns HTTP 524 (origin timeout) on DNS write operations during their own platform incidents. Always wrap your update tasks with retries: 3 and delay: 10. Without this, a transient Cloudflare hiccup during your incident becomes a compounded failure.

The Correct Approach: Ansible Playbook Architecture

A production-safe failover playbook has three distinct plays, not one. Pre-flight validation, the record swap, and post-flight verification. Skipping the third play means you have no confirmation the change actually propagated — your playbook reports success while resolvers are still serving the dead IP.

Store your credentials in Ansible vault at group_vars/all/vault.yml with keys vault_cf_api_token and vault_cf_zone_id. Use an API token scoped to Zone > DNS > Edit for the specific zone only — never the Global API Key. The Global API Key cannot be scoped, cannot be rotated without breaking everything else, and is a security antipattern that I refuse to use in any production system.

Here is the full three-play failover playbook. Read the inline comments — several of them encode hard-won lessons:

# dns_failover.yml
# Cloudflare DNS failover playbook — swaps A record to standby IP
# Requires: community.cloudflare >= 1.0.0, community.dns >= 2.5.0
# Vault vars: vault_cf_api_token, vault_cf_zone_id
# Usage: ansible-playbook dns_failover.yml -e "target_hostname=app.example.com failover_ip=203.0.113.50"

---
- name: Pre-flight — validate current DNS state
  hosts: localhost
  gather_facts: false
  vars:
    cf_api_token: "{{ vault_cf_api_token }}"
    cf_zone_id: "{{ vault_cf_zone_id }}"

  tasks:
    - name: Resolve current A record for target hostname
      community.dns.lookup:
        name: "{{ target_hostname }}"
        type: A
        timeout: 5        # Required — dnspython 2.4.x hangs without this
      register: current_dns

    - name: Abort if already pointing to failover IP (idempotency guard)
      ansible.builtin.fail:
        msg: >
          DNS already points to {{ failover_ip }}.
          Failover may have already run. Aborting to prevent duplicate API calls.
      when: failover_ip in (current_dns.result | map(attribute='address') | list)

    - name: Record pre-failover state to local facts file
      ansible.builtin.copy:
        content: |
          {
            "hostname": "{{ target_hostname }}",
            "original_ip": "{{ current_dns.result[0].address }}",
            "failover_ip": "{{ failover_ip }}",
            "timestamp": "{{ ansible_date_time.iso8601 }}"
          }
        dest: /var/lib/ansible/dns_failover_state.json
        mode: '0600'      # Restrict — this file contains your original IP history

- name: Execute DNS record swap
  hosts: localhost
  gather_facts: false
  vars:
    cf_api_token: "{{ vault_cf_api_token }}"
    cf_zone_id: "{{ vault_cf_zone_id }}"

  tasks:
    - name: Fetch current DNS record ID dynamically (never use static record_id)
      ansible.builtin.uri:
        url: "https://api.cloudflare.com/client/v4/zones/{{ cf_zone_id }}/dns_records?type=A&name={{ target_hostname }}"
        method: GET
        headers:
          Authorization: "Bearer {{ cf_api_token }}"
          Content-Type: "application/json"
        status_code: 200
      register: cf_record_lookup
      retries: 3
      delay: 10           # Cloudflare record IDs change on delete+recreate — always look up dynamically

    - name: Extract record ID from API response
      ansible.builtin.set_fact:
        cf_record_id: "{{ cf_record_lookup.json.result[0].id }}"

    - name: Update A record to failover IP via Cloudflare API
      ansible.builtin.uri:
        url: "https://api.cloudflare.com/client/v4/zones/{{ cf_zone_id }}/dns_records/{{ cf_record_id }}"
        method: PUT
        headers:
          Authorization: "Bearer {{ cf_api_token }}"
          Content-Type: "application/json"
        body_format: json
        body:
          type: A
          name: "{{ target_hostname }}"
          content: "{{ failover_ip }}"
          ttl: 1          # Cloudflare auto TTL — effective for proxied records
          proxied: true   # CRITICAL: do NOT set to false during incidents — you lose DDoS protection
        status_code: 200
      register: cf_update_result
      retries: 3
      delay: 10
      failed_when: not cf_update_result.json.success

- name: Post-flight — verify DNS propagation
  hosts: localhost
  gather_facts: false

  tasks:
    - name: Wait for DNS to reflect failover IP (poll every 15s, max 2 min)
      community.dns.lookup:
        name: "{{ target_hostname }}"
        type: A
        timeout: 5
      register: post_failover_dns
      retries: 8
      delay: 15
      until: failover_ip in (post_failover_dns.result | map(attribute='address') | list)
      # Note: do NOT use a notify handler here — handlers run at end of play,
      # meaning your playbook would report success before DNS is confirmed

    - name: Confirm success and emit structured log
      ansible.builtin.debug:
        msg:
          event: dns_failover_complete
          hostname: "{{ target_hostname }}"
          new_ip: "{{ failover_ip }}"
          verified: true

Watch out: Never use a notify handler for the post-verification task. Handlers execute at the end of the play, which means your playbook would report success before DNS resolution is confirmed. Use a direct task with register and until as shown above.

Advanced Patterns: Event-Driven Failover and Rollback Safety

The playbook above is the mechanism. The architecture around it determines whether it’s actually automated or just “automatable.”

For event-driven triggering, I use Alertmanager webhook receivers pointed at AWX (Ansible Automation Platform) job templates. When Alertmanager fires an alert with severity=critical and service=origin, it sends a POST to the AWX webhook endpoint, which launches the failover playbook with extra_vars containing target_hostname and failover_ip. No human in the loop. The ansible-runner version matters here — use 2.3.x or higher. Older versions silently drop extra vars passed via webhook payload, which means your playbook runs with empty variables and fails in a confusing way. See the AWX webhook documentation for the full integration setup.

For rollback, the key principle is: never hardcode the original IP. You wrote that playbook six months ago. Your origin IP has changed twice since then. The rollback playbook must read the pre-failover state from the facts file written during the forward failover run. Here’s the rollback playbook:

# dns_failover_rollback.yml
# Reads pre-failover state from facts file and restores original record
# Run after incident resolution — do NOT hardcode original IP

---
- name: Rollback DNS to pre-failover state
  hosts: localhost
  gather_facts: false
  vars:
    cf_api_token: "{{ vault_cf_api_token }}"
    cf_zone_id: "{{ vault_cf_zone_id }}"
    state_file: /var/lib/ansible/dns_failover_state.json  # Must be on persistent storage, not tmpfs

  tasks:
    - name: Load pre-failover state from facts file
      ansible.builtin.slurp:
        src: "{{ state_file }}"
      register: failover_state_raw

    - name: Parse state JSON
      ansible.builtin.set_fact:
        failover_state: "{{ failover_state_raw.content | b64decode | from_json }}"

    - name: Confirm rollback target with operator (interactive gate)
      ansible.builtin.pause:
        prompt: >
          Rolling back {{ failover_state.hostname }}
          from {{ failover_state.failover_ip }}
          to {{ failover_state.original_ip }}.
          Press ENTER to confirm or Ctrl+C to abort.

    - name: Fetch current record ID for rollback target
      ansible.builtin.uri:
        url: "https://api.cloudflare.com/client/v4/zones/{{ cf_zone_id }}/dns_records?type=A&name={{ failover_state.hostname }}"
        method: GET
        headers:
          Authorization: "Bearer {{ cf_api_token }}"
        status_code: 200
      register: rollback_record_lookup

    - name: Restore original A record
      ansible.builtin.uri:
        url: "https://api.cloudflare.com/client/v4/zones/{{ cf_zone_id }}/dns_records/{{ rollback_record_lookup.json.result[0].id }}"
        method: PUT
        headers:
          Authorization: "Bearer {{ cf_api_token }}"
          Content-Type: "application/json"
        body_format: json
        body:
          type: A
          name: "{{ failover_state.hostname }}"
          content: "{{ failover_state.original_ip }}"
          ttl: 1
          proxied: true
        status_code: 200
      failed_when: not (rollback_record_lookup.json.success | default(true))

    - name: Archive state file post-rollback (do not delete — keep for audit trail)
      ansible.builtin.command:
        cmd: "mv {{ state_file }} {{ state_file }}.{{ ansible_date_time.epoch }}.bak"

For multi-origin setups, extend the state storage to an S3 bucket using amazon.aws.s3_object. Write and read the dns_state.json atomically. This survives the Ansible control node crashing mid-incident — which happens more often than you’d think, because the control node is often under load during the same event that triggered failover.

Performance and Operational Cost Notes

Let’s talk numbers, because SLOs need real data behind them.

The Cloudflare API p99 response time for a DNS record update sits around 180–400ms. A complete three-play playbook run — pre-flight, swap, and post-flight verification — completes in 8–15 seconds under normal conditions. That is your minimum achievable RTO using this method. Plan your SLOs accordingly. If you need sub-5-second failover, you need Cloudflare’s Load Balancer or a BGP-based solution, not scripted DNS.

On TTL strategy: set proxied records to ttl: 1 permanently, not just during incidents. I’ve seen teams try to reactively lower TTL when they detect problems — that does absolutely nothing for resolvers that already cached the old value at a higher TTL hours earlier. The time to set a low TTL is before you need it, not during the incident.

Cloudflare’s rate limit is 1200 API requests per 5 minutes per token. A single failover playbook run uses approximately 4–6 API calls (GET zone lookup if you don’t pin zone_id, GET record, PUT update, optional verification calls). You have headroom for roughly 200 concurrent failover runs before hitting limits — more than enough for any reasonable fleet size.

Cost comparison: Cloudflare Load Balancer costs $5/month per hostname plus $0.50 per 500k health check queries. For teams with fewer than five critical hostnames and an acceptable RTO of 2–5 minutes, the Ansible API approach is a legitimate cost-optimized architecture, not a workaround. It’s a deliberate trade-off between automation latency and infrastructure spend. Document that trade-off explicitly in your runbook so the next engineer doesn’t “fix” it by adding Load Balancer without understanding why it wasn’t there. You can find the full Cloudflare DNS API reference at the official Cloudflare developer documentation.

One final security note: the Ansible control node running this playbook has DNS-edit access to your entire zone. Treat it as a privileged host. Restrict SSH access, enable auditd, and rotate the API token every 90 days. The token scope is Zone > DNS > Edit for the specific zone only — scope it as narrowly as Cloudflare allows. For more patterns on securing your automation infrastructure, see the related posts at kuryzhev.cloud.

Related

Leave a Reply

Your email address will not be published. Required fields are marked *

Support us · 💳 Monobank