PROBLEM: Animations for a kids' phonics app need expressive face/body params for 10 animals. No animator on staff. Hand-keyframing 10 animals × 9 gestures each = months of work I don't have.

WHY IT MATTERS: The product surface depends on animation fidelity — kids respond to expressive characters. The pipeline has to take human-recorded reference video as input and produce Rive-ready params as output, automatically, per-animal.

STACK: Python (RTMLib 2D landmarks, BABEL/AMASS MoCap data, savgol smoothing, RDP curve reduction), Rive runtime (Kotlin Android), JSON manifests for character scaling, MCP-driven .riv emission

Pose Extraction → Find the Sound

End-to-end ML loop: human video → 2D pose → character scaling → Rive runtime → on-device feedback in the L2R app

The interesting ML problem isn't "extract poses from video." That's a solved library call. The interesting problem is making the extracted poses actually drive a believable animation on a stylized animal that has nothing to do with a human skeleton.

The end-to-end loop

① Diane's reference video

5 verified gesture videos (ask_child, celebrating, encouraging, look_at_letter, thinking) — me doing the gesture in front of a webcam. Plus BABEL/AMASS public MoCap data for seed gestures (idle, walk_in, waiting, helping).

→

② Pose extraction → Rive params

Per-frame 2D landmarks → CharacterScaler per-animal → schema (magnitudes per body part) + curves (temporal shapes) + auto specs (engine-ready tracks). Outputs a wired .riv file.

→

③ Find the Sound (in-app)

Kid taps a sound in the L2R Android app. The wired .riv plays the gesture (animal celebrates, encourages, asks). Ships in production V8 release.

Watch the loop close — Find the Sound demo

The pipeline mechanics

Four-step pipeline. Steps 0 and 1 are the ML core; steps 2 and 3 are the wiring + verification.

STEP 0 — Safety: Copy the .riv before touching anything

The .riv files are hand-created art assets. The pipeline never touches the original — it works on a copy. shutil.copy2(original_riv, output_dir/starter_copy.riv). This is the kind of safety guarantee you only learn to want after losing two days of art work to a buggy script.

STEP 1 — build_external_gesture_bundle()

The ML core. For each gesture, produce schema + curves + auto-specs.

A. CharacterScaler from manifest
   arm_translation_scale = arm_length / human_arm_reach
   arm_translation_cap   = arm_length            (hand can't go past wrist)
   lateral_sway_cap      = body_width × 0.5
   Chick: arm_length=156px → scale=0.277, cap=156px, sway=212px
   Bigger animal auto-adjusts. No hardcoded values.

B. Load configs (animation_defaults.json, gesture_disambiguation.json)

C. For each gesture:
   Read Babel segment JSON  (60 frames × 3 segments)
   _derive_schema_entry()   — percentile(95) of arm angle, root.y, body.rot, head.rot
   _apply_video_priority()  — Diane's video magnitudes are the FLOOR; Babel can ADD but not OVERRIDE
   _apply_gesture_disambiguation() — enforces look_at_letter≠ask_child even if MoCap labels overlap
   _derive_curve_entry()    — savgol smooth → cycle detect → RDP reduce
                              → motion_shape, peak, p95, cycle_period

D. auto_gesture_specs.generate_specs_for_gesture()
   Converts curves + schema → engine-ready track specs
   { role: "arm_left", prop: "rotation", pattern, mag, dur }
   FPS scaling: source 20fps (Babel) → playback 60fps → 3× longer cycles
   Prevents: celebrating 0.4s jitter → 1.3s graceful cycle

Outputs: gesture_schema_v3.json, motion_curves.json, auto_gesture_specs.json, bundle_metadata.json, motion_curves_dashboard.html (interactive frame browser)

STEP 2 — wire_from_manifest.py

Loads the COPIED .riv (art only) and emits ~1200 keyframe calls via Rive MCP — creates ViewModel + State Machine, wires idle, blink overlay, mouth visemes, ALL gesture poses, then exports the wired .riv.

prompt_1: ViewModel + State Machine + boolean inputs
prompt_2: Verify SVG groups match manifest group_map
prompt_3: Emit idle animation
prompt_4: Blink overlay (independent layer)
prompt_5: Mouth visemes (6 shapes, independent layer)
prompt_6: ALL gesture poses
prompt_7: Audit probe (list_objects verification)
prompt_8: State machine (schema-driven transitions, 3 layers)
prompt_9: Post-wire verification
prompt_10: Export .riv

gesture_engine.emit_gesture() — the keyframe emitter
  For each track in auto_gesture_specs:
    _mag()    — magnitude from bundle schema
    Pattern   — (frame, value, easing) list
    _clamp()  — POSE_LIMITS safety envelope
  Pivot resolution: parts_metadata.pivot → layers.pivot → artboard center
  ~1200 sequential MCP calls per animal (chick reference)

Output: <animal>_babel_tuned_wired.riv

STEP 3 — gesture_postwire_eval.py

Verification step — re-loads the exported .riv via MCP and validates structure: ViewModel properties present + typed, State Machine inputs match gesture count, animation names match schema, layer count correct. Outputs JSON + TSV evals.

Honest limitation noted in code: this checks structural correctness, not visual quality. Visual QA is human-in-the-loop via the previewer.

STEP 4 — generate_previewer.py

Generates interactive HTML with Rive WASM runtime — embeds the wired .riv as base64, exposes gesture buttons (boolean SM inputs) and viseme slider (number input). This is the visual QA loop. The frame-by-frame view of one such generated preview is linked below.

Drill-downs

📷 Frame Browser — encouraging gesture (interactive) Per-frame view of pose extraction output for the "encouraging" gesture.

Why this matters as a Generative AI / ML engineering pattern

This is not a single-model story. It's a pipeline story — and pipelines are where most "ML in production" failures actually happen. Each step is small, deterministic, and replaceable. The character scaling step doesn't know about the curve smoothing step. The keyframe emitter doesn't know about the eval. Each stage can be swapped (e.g., RTMLib → MediaPipe; Babel MoCap → custom data) without touching the others.

Two design choices worth calling out:

Video-as-floor, MoCap-as-ceiling. Diane's videos are the trusted source for magnitudes. BABEL/AMASS data can ADD new params (translation, derived motion) but can't OVERRIDE measured values. This solves the "MoCap drives celebration into a 14px shrug" problem by structure, not by tweaking constants.
Per-animal scaling from manifest, not hardcoded. A bigger animal automatically gets larger arm caps and translation scales. Adding a 10th animal doesn't require code changes — just a manifest with the part metadata.

Outcome:
Production-shipping in L2R V8 release (Play Store gate passed).
5 base videos × multiple animals × 9 gestures each, all auto-derived.
Pipeline is reproducible: python3 scripts/run_external_preview_batch.py rebuilds all animals.

Pipeline source: projects/rive_animation/new_pipeline/runner/external_previewer.py
Full pipeline diagram: vault/2 dev/vp-ml-neural/rive_animation/diagrams/2026-04-18-external-previewer-pipeline-diagram-v2.md
Frame browser interactive view: frame_browser.html