Wave Run v2 — Multi-Agent Orchestration as a DAG

An LLM with all the orchestration info in context will skip the "delegate to a separate session" step under pressure. The fix is not "remind the LLM more clearly." The fix is to put the orchestration decisions where the LLM can't reach them.

The 10-phase process

Each phase is an Airflow task. The scheduler enforces order; the LLM only runs inside individual sessions, not across them. Status: ✅ codified · 🟡 partial · ⚪ backlog

Phase 000 — Entrypoint Contract✅ codified

Validate trigger conf has the canonical fields before any other task runs.

Input: dag_run.conf { run_id, manifest_path, target_wave?, manifest_db_path? } Gate: require run_id != 'wave-unset'; require manifest_path exists; target_wave int when provided Output: PASS → resolve_execution_context | FAIL → hard block

Eliminates: phase skip / improvisation (Pattern E) — DAG can't run downstream without entrypoint pass

Phase 0 — Load Manifest✅ codified

Resolve manifest path; bind run to executable session inventory.

Input: run_id, manifest_path Logic: wave_actions._load_manifest(source_path) - direct wave_sessions/sessions - or CPV3 phase_12.manifest_path redirect Output: resolved manifest path + session_count + target_wave metadata

Eliminates: wrong-repo worktrees (Pattern F) — repo attestation is per-session in the manifest, validated here

Phase 0.5 — Bind Wave Forum Thread✅ codified

Create or bind the per-wave coordination thread; persist thread id for audit.

Logic: 3-mode binding contract 1) POST /api/db/intake 2) fallback POST /api/db/threads 3) fallback /tmp/wave-run-<slug>.log Output: wave_forum_thread_id persisted in manifest + manifest.db run row

Why it's a phase, not an afterthought: the coordination/audit channel binds at execution start, not when something goes wrong

Phase 0.7 — Preflight Lint✅ codified

Mechanical lint of every directive: frontmatter, required sections, MERGE GATE block auto-injected.

Input: directive_dir + expected_sessions (target-wave scoped) Logic: _preflight_errors() → _autofix_directives() → _dispatch_repair_directive() (one retry) Output: PASS or fail-closed after retry

Eliminates: VP self-merge (Pattern A) — every directive lands with MERGE GATE block; VPs no longer default-merge at end of session

Phase 1 — Ownership Check✅ codified

Hard-fail if two parallel sessions claim the same files.

Logic: validate_ownership_overlap() if fail → _dispatch_repair_directive() + one retry Output: PASS or RuntimeError("Ownership gate failed after repair retry")

Eliminates: merge-conflict-by-design — overlapping file ownership is a precondition error, caught before dispatch

Phase 2 — Tap #1 (Pre-launch approval)✅ codified

Typed human approval state machine before dispatch.

Input: marker files (.approved/.rejected/.hold/.expired) OR manifest.db decision events Logic: _poll_tap_decision() → sensor_tap_prelaunch() States: approved → proceed; pending → poll; rejected/hold/expired → hard-fail with terminal decision

Why typed states: previously approval timeout silently degraded to "go anyway" — now terminal non-approval blocks dispatch

Phase 3 — Worktree Gate + Dispatch✅ codified

Verify worktrees exist + branch alignment before launching any session.

Logic: worktree_gate() — verify path exists + git worktree list maps path → expected branch launch_wave_sessions() — dispatch_directive(...) per session, persist manifest after each success Output: per-session status=launched OR first-failure halts remainder (no partial-launch desync)

Eliminates: worktree gate skipped (Pattern C) — sessions can't accidentally commit to main because the gate verifies branch isolation BEFORE any agent starts

Phase 4 — Execute & Monitor✅ codified

Wait for handoffs with freshness validation; never accept stale state as "ready."

Logic: sensor_wait_wave_sessions() - require non-empty selected sessions - handoff_exists_strict(...) - freshness check vs session_start validate_session_outputs() — same freshness check before commit readiness Output: fresh + complete → advance | stale/empty → block (never false-ready)

Eliminates: teardown skip (Pattern D) — Phase 5 sensor blocks until /end-session per session lands a fresh handoff

Phase 5–6 — Orchestrator Merge🟡 partial

Spawn a separate orchestrator session to execute merges. Coordinator never runs git itself.

Logic: merge orchestrator dispatched as autonomous session - generates orchestrator directive from template - runs merge_branches.sh with conflict detection + rollback Why this is the most important phase: the COORDINATOR (this DAG) has no git access. The ORCHESTRATOR (a separate cmux session) does. Separation is mechanical, not by convention.

Eliminates: coordinator merges directly (Pattern B) — DAG tasks have no shell access to git

The 7 failure patterns this DAG eliminated

Source: forensic analysis of 38 wave-run handoffs across 9 sprints (Apr 11–20, 2026).

Why this matters as a Generative AI engineering pattern

Pattern	Frequency	Cost / incident	How v2 prevents it
A: VP Self-Merge	3 in 1 sprint	~1 day cleanup	Phase 0.7 lint auto-injects MERGE GATE block into every directive before dispatch
B: Coordinator Merges Directly	4 incidents, 3 sprints	30–90 min	Coordinator is a DAG — no shell, no git. Orchestrator is a separately-dispatched session
C: Worktree Gate Skipped	1 critical (Apr 13)	60+ min	Phase 3 task `worktree_gate` blocks dispatch unless `git worktree list` matches manifest
D: Teardown Skipped	1 incident	~1 day cleanup	Phase 5 sensor blocks until each session writes a fresh handoff (freshness validated against session_start)
E: Phase Skip / Improvisation	2 incidents	varies	Airflow scheduler enforces phase order. There's nothing to improvise — the next task is whatever the DAG runs next
F: Wrong-Repo Worktrees	1 incident (CYOA)	~30 min	Per-session repo attestation in directive frontmatter, validated at Phase 0 entrypoint contract
G: Dispatch Bypass	1 incident	varies	Dispatch is a Python module called by the DAG, not a prose instruction the operator can ignore

Most "agents fail at X" stories end with "we wrote a stricter prompt." That's a moving target. The real fix is to identify what the LLM shouldn't be deciding, then put those decisions in deterministic code where the LLM can't reach them.

In this system, the LLM is excellent at doing the work inside a session — writing code, running tests, producing handoffs. It is unreliable at coordinating across sessions, where context is thin and incentives compound (every agent thinks it should "ship now"). So the architecture moved orchestration into Airflow, kept LLMs inside the per-session boxes, and gave the human a single approval surface.

The result is the GenAI version of "separate the things that change from the things that don't." Coordination doesn't change between sprints — it's a fixed 10-phase contract. The work inside each session changes every time. So coordination is code; work is LLM.

Result:
7 documented failure patterns from the prose-skill version eliminated mechanically.
~4–6 Diane-hours of merge cleanup per sprint reclaimed. Across 9 documented sprints (Apr 11–20), that's 1–2 working days returned.

Forensic source: vault/3 DIANE/vp-opus/pack-up/2026-04-21-analyst-waverun-failures.md
Phase contract: vault/3 DIANE/vp-opus/pack-up/2026-04-25-waverun-spec-comparison-phase0-4.md
Live DAG: projects/systems_infra-runtime-master/orchestration/dags/wave_run_dag_v2.py