← Back to portfolio
PROBLEM: A 24/7 monitoring system can't afford an LLM call in every hot path. At 1 incident every 10 minutes × $0.50/incident, that's $72/day on a system that lives indefinitely.
WHY IT MATTERS: Most production events are deterministic — a stuck DAG, a missing artifact, a stale handoff. Reserve LLM cost for the events that genuinely need reasoning. Cost-tier the watchdog so the cheap layers fail-fast and the expensive layer is the exception, not the default.
STACK: Python (analytics/health_check.py — 61-probe catalog; analytics/health_fix.py — recipe runner), YAML (recipe_config.yaml keyed by probe_id), Anthropic Claude routine for L3 webhook escalation, Telegram Bot API for L4 human tap, cron + plist for L5 Friday digest

Airflow 5-Tier Watchdog — LLM Cost Optimization

12 deliverables (D1–D12), all merged 2026-05-01. 61-probe failure-mode catalog. Diane is at the END of the chain, not the beginning.
If a system runs 24/7, you can't afford an LLM in every hot path. Most events are deterministic. Reserve LLM cost for the ones that aren't.

The 5-tier escalation

Each tier handles what it can, then escalates only what it can't. By the time an event reaches L3 (the LLM tier), L1 and L2 have already filtered out 90%+ of incidents.

L1 · Probes (24/7)$0 / incident
Deterministic Python checks running every 10 min — DAG state, file freshness, schema integrity, endpoint health.
61-probe catalog. If a probe fires, log to ops.db and try L2. If a probe stays green, do nothing.
↓ probe failed
L2 · Recipes (deterministic repair)$0 / incident
Predefined repair scripts keyed by probe_id. airflow dags backfill, git fsck, restart worker, repoint symlink.
Recipe table in YAML. If recipe succeeds, mark resolved + log. If recipe fails or there's no recipe, escalate to L3.
↓ recipe failed or missing
L3 · Claude Routine Webhook$0.05–$0.20 / incident
Anthropic-hosted Claude routine airflow-investigator-bridge reads last 50 log lines + recipe attempts, proposes a patch.
This is the only LLM in the loop. Bounded prompt size (5K tokens in, 2K out). If the routine proposes a code patch, it queues for L4 human approval.
↓ patch proposed or unresolved
L4 · Telegram Tap (human approval)human time only
Tap-to-approve Telegram message: "L3 proposes patch X for probe Y. Approve / Reject."
Human reads patch summary on phone, taps. If approved, the patch runs through merge_branches.sh + verify gate. If rejected, the incident is marked unresolved and surfaces in L5 digest.
↓ if no resolution after 24h
L5 · Friday Digest (batched)$0 (cron) + 15 min/wk read
Weekly summary of unresolved incidents, cost breakdown, recipe hit-rate, top failure patterns.
One email per Friday morning. Inputs to next sprint's prioritization (which probes need new recipes, which recipes need updating).

The escalation in practice — example

L1Probe dag_stuck_check fires: plan_part1 has been "running" for 47 min (threshold 30).
L2Recipe lookup for dag_stuck_check → run airflow dags backfill --reset-dagruns plan_part1. Recipe runs, but task still stuck after 5 min.
L3Webhook fires to Anthropic Claude routine. Payload: last 50 log lines + recipe attempt + DAG file. Claude responds: "Worker is holding stale Python imports — need to kill -HUP the celery worker." Generates patch.
L4Telegram message: "L3 proposes worker restart for plan_part1 stall. Tap ✅ to apply." Diane taps from phone in 30 sec.
DONEPatch applied, worker restarted, DAG resumes. Total cost: $0.07 (one Claude call). Total Diane time: 30 sec.

Why this matters as a Generative AI engineering pattern

The default architecture for "LLMs in production" is to put the LLM in the hot path — every event flows through it. That works at MVP scale and breaks at production scale: the LLM becomes the most expensive, slowest, least reliable layer in the system.

The fix is to tier the system by determinism. Layers that can be deterministic should be — they're cheaper, faster, and more reliable. Reserve the LLM tier for events where the cost of human reasoning would otherwise be required, and bound that tier (5K tokens in, 2K out, single response) so cost is predictable.

This is the same pattern that makes good caching architectures work — except the "cache" here is determinism itself. L1 and L2 are the cache; L3 is the cache miss. The goal is a 90%+ hit rate at L1+L2 so L3 is rare and L4 is rarer still.

Cost math

Tier Frequency (estimated) Per-incident cost Cost / week
L1 probes 1 firing / hour avg $0 $0
L2 recipes ~80% of L1 firings → resolved here $0 $0
L3 Claude routine ~5–10 incidents/week (the L2 misses) $0.05–$0.20 ~$0.50–$2.00
L4 Telegram tap ~1–3 / week 30 sec human time ~5 min
L5 weekly digest 1 / week 15 min human time 15 min
Result: 12 deliverables (D1–D12) shipped + merged 2026-05-01.
61-probe failure-mode catalog covers all known production failure modes.
LLM cost capped at single-digit dollars per week; LLM is invoked only when deterministic tiers have already failed.
Diane sits at L4, not L1 — escalation flow ends with her, doesn't start with her.