Airflow 5-Tier Watchdog — LLM Cost Optimization

Each tier handles what it can, then escalates only what it can't. By the time an event reaches L3 (the LLM tier), L1 and L2 have already filtered out 90%+ of incidents.

L1 · Probes (24/7)$0 / incident

Deterministic Python checks running every 10 min — DAG state, file freshness, schema integrity, endpoint health.

61-probe catalog. If a probe fires, log to ops.db and try L2. If a probe stays green, do nothing.

↓ probe failed

L2 · Recipes (deterministic repair)$0 / incident

Predefined repair scripts keyed by probe_id. airflow dags backfill, git fsck, restart worker, repoint symlink.

Recipe table in YAML. If recipe succeeds, mark resolved + log. If recipe fails or there's no recipe, escalate to L3.

↓ recipe failed or missing

L3 · Claude Routine Webhook$0.05–$0.20 / incident

Anthropic-hosted Claude routine airflow-investigator-bridge reads last 50 log lines + recipe attempts, proposes a patch.

This is the only LLM in the loop. Bounded prompt size (5K tokens in, 2K out). If the routine proposes a code patch, it queues for L4 human approval.

↓ patch proposed or unresolved

L4 · Telegram Tap (human approval)human time only

Tap-to-approve Telegram message: "L3 proposes patch X for probe Y. Approve / Reject."

Human reads patch summary on phone, taps. If approved, the patch runs through merge_branches.sh + verify gate. If rejected, the incident is marked unresolved and surfaces in L5 digest.

↓ if no resolution after 24h

L5 · Friday Digest (batched)$0 (cron) + 15 min/wk read

Weekly summary of unresolved incidents, cost breakdown, recipe hit-rate, top failure patterns.

One email per Friday morning. Inputs to next sprint's prioritization (which probes need new recipes, which recipes need updating).

The escalation in practice — example

L1Probe dag_stuck_check fires: plan_part1 has been "running" for 47 min (threshold 30).

L2Recipe lookup for dag_stuck_check → run airflow dags backfill --reset-dagruns plan_part1. Recipe runs, but task still stuck after 5 min.

L3Webhook fires to Anthropic Claude routine. Payload: last 50 log lines + recipe attempt + DAG file. Claude responds: "Worker is holding stale Python imports — need to kill -HUP the celery worker." Generates patch.

L4Telegram message: "L3 proposes worker restart for plan_part1 stall. Tap ✅ to apply." Diane taps from phone in 30 sec.

DONEPatch applied, worker restarted, DAG resumes. Total cost: $0.07 (one Claude call). Total Diane time: 30 sec.

Why this matters as a Generative AI engineering pattern

The default architecture for "LLMs in production" is to put the LLM in the hot path — every event flows through it. That works at MVP scale and breaks at production scale: the LLM becomes the most expensive, slowest, least reliable layer in the system.

The fix is to tier the system by determinism. Layers that can be deterministic should be — they're cheaper, faster, and more reliable. Reserve the LLM tier for events where the cost of human reasoning would otherwise be required, and bound that tier (5K tokens in, 2K out, single response) so cost is predictable.

This is the same pattern that makes good caching architectures work — except the "cache" here is determinism itself. L1 and L2 are the cache; L3 is the cache miss. The goal is a 90%+ hit rate at L1+L2 so L3 is rare and L4 is rarer still.

Cost math

Tier	Frequency (estimated)	Per-incident cost	Cost / week
L1 probes	1 firing / hour avg	$0	$0
L2 recipes	~80% of L1 firings → resolved here	$0	$0
L3 Claude routine	~5–10 incidents/week (the L2 misses)	$0.05–$0.20	~$0.50–$2.00
L4 Telegram tap	~1–3 / week	30 sec human time	~5 min
L5 weekly digest	1 / week	15 min human time	15 min

Directive source: vault/DIRECTIVES/systems-repair/2026-04-30-airflow-operator.md
Implementation: projects/systems_infra/analytics/health_check.py + health_fix.py + recipe_config.yaml
Runtime memory: memory/project_airflow_operator.md

Airflow 5-Tier Watchdog — LLM Cost Optimization

The 5-tier escalation

The escalation in practice — example

Why this matters as a Generative AI engineering pattern

Cost math