01 · LLM Cost / Context Engineering
Thread Routing & Context Optimization
Problem: A multi-agent system produced hundreds of unstructured messages a day across 8 VP agents — review time became the bottleneck.
Why it matters: Without classification + routing, every message is a context-switch tax. With it, only the right thread reaches the right person, and per-session token cost becomes attributable.
What I built:
- 9-class message classifier with routing rules
- Post-classification flow dashboard (system map of where threads go)
- Before/after token-cost timeline per VP per sprint phase
- Local-extraction script that compresses prior-sprint state into the next session's context
Result: Per-session token cost made visible and attributable per VP, per sprint phase. Routing replaces inbox-style triage — only the relevant thread reaches the right person.
Stack: Python (classifier), SQLite (forum.db with FTS5), Flask (localhost:5556 API), HTML/CSS/JS
02 · Multi-Agent Orchestration
Wave Run v2 — Prose Skill → Airflow DAG
Problem: A 10-phase prose skill for running parallel LLM agents produced 7 distinct, repeated failure patterns. Cost: 4–6 hours of merge cleanup per sprint.
Why it matters: Prose instructions are not enforcement. An LLM with all the orchestration info in context will improvise under pressure. The fix isn't clearer prose — it's removing the LLM from the orchestration decision entirely.
What I built:
- Codified the 10-phase skill into
wave_run_v2 Airflow DAG
- Mechanical phase gates: entrypoint contract → manifest load → forum bind → preflight lint with autofix → ownership check → typed approval → worktree gate → dispatch → freshness-validated handoff → output validation
- Coordinator (DAG) has no git access; orchestrator (separate session) does — separation is mechanical, not by convention
Result: 7 documented failure patterns from the prose-skill version eliminated mechanically. 4–6 Diane-hours of merge cleanup per sprint reclaimed.
Stack: Airflow 3.x, Python, Celery, cmux, Postgres + Redis, manifest.yaml + manifest.db
03 · LLM Cost Optimization
Airflow 5-Tier Watchdog
Problem: A 24/7 monitoring system can't afford an LLM call in every hot path. At 1 incident per 10 min × $0.50/incident, that's $72/day on a system that lives indefinitely.
Why it matters: Most production events are deterministic. Reserve LLM cost for events that genuinely need reasoning. Tier the watchdog so the cheap layers fail-fast and the expensive layer is the exception, not the default.
What I built:
- L1 probes — 24/7 deterministic checks ($0/incident, 61-probe catalog)
- L2 recipes — predefined repair scripts ($0/incident, YAML-keyed)
- L3 webhook — Claude routine for the L2 misses ($0.05–0.20/incident)
- L4 Telegram tap — human approval for proposed patches
- L5 weekly digest — batched unresolved incidents
Result: 12 deliverables (D1–D12) shipped + merged 2026-05-01. LLM cost capped at single-digit dollars per week. Diane sits at L4, not L1 — escalation flow ends with the human, doesn't start with them.
Stack: Python (health_check.py, health_fix.py), YAML config, Anthropic Claude routine, Telegram Bot API, cron + plist
04 · On-Device ML
Phoneme Classifier — 97% accuracy on-device
Problem: Speech recognition for 4-year-olds, on-device, on a phone. Without sending audio of children to any cloud.
Why it matters: Cloud calls would mean COPPA compliance overhead, latency, and ongoing inference cost. Most speech models are MB-to-GB. The constraint: fit accurate phoneme recognition in tens of KB that runs on a low-end Android in <100ms.
What I built:
- Two-track classifier — Track A (MFCC + cosine, 15.4KB asset), Track B (WavLM fine-tune, ONNX export)
- VTLN factor 1.104 derived from child F0 mean (269 Hz) — adult-trained features warped to child vocal tract
- Sander 1972 substitution table for age-gated developmental allowances
- 4-gate noise rejector for bedroom-recording realism
- Confusion matrix audit, 120 unit tests across 9 files
Result: 97.0% accuracy on full 66-phoneme test set (only 2 phonetically-identical confusions). 100% on Diane's voice baseline. Production-shipping in L2R Android app.
Stack: Python (librosa, scikit-learn, ONNX, PyTorch for Track B), Kotlin (custom PhonemeClassifier on Android)
05 · End-to-End ML Pipeline
Pose Extraction → Find the Sound
Problem: Animations for a kids' phonics app need expressive face/body params for 10 animals. No animator on staff. Hand-keyframing 10 animals × 9 gestures = months I don't have.
Why it matters: Product surface depends on animation fidelity — kids respond to expressive characters. The pipeline has to take human-recorded reference video as input and produce Rive-ready params as output, automatically, per-animal.
What I built:
- End-to-end pipeline: human video → 2D pose → CharacterScaler per-animal → Rive runtime params
- Video-as-floor, MoCap-as-ceiling: Diane's verified videos set magnitude floor; BABEL/AMASS MoCap can add but not override
- Per-animal scaling from manifest (not hardcoded) — adding the 10th animal requires zero code changes
- Ships keyframes via Rive MCP (~1200 calls per animal); post-wire structural eval
Result: Production-shipping in L2R V8 release (Play Store gate passed). 5 base videos × multiple animals × 9 gestures, all auto-derived. Reproducible via one batch command.
Stack: Python (RTMLib 2D landmarks, BABEL/AMASS, savgol smoothing, RDP), Rive runtime (Kotlin Android), JSON manifests, MCP-driven .riv emission