Learn to Read · Rive Animation · Data Extraction

From hardcoded offsets to data-driven, pose-based gestures

How my own movement became the source of truth for 10 animal characters. Same pattern as the phoneme classifier's feature pipeline — record the real signal, extract a numeric schema, let the schema drive everything downstream.

What I built:

Pose-extraction pipeline turning my own gesture videos into per-frame body data (133 keypoints + facial blendshapes + iris)
Universal-rest-pose model so joint deltas are credible — solved the "no clear rest frame in a performing person" problem
95th-percentile aggregation: ~100 frames of pose data → ~30-parameter gesture schema
v3 schema marks every parameter as source: measured instead of source: seed — the engine no longer has hardcoded keyframe values

Result: A single config (gesture_schema_v3.json) that drives all downstream wiring. Adding a new gesture is a recording, not a tuning session.

Stack: Python (RTMLib 133-keypoint pose extraction with MediaPipe fallback, NumPy, savgol smoothing), JSON manifests for the schema contract

Take a look at this brief walkthrough — eight beats, click through at your own pace, video to schema in one screen.

Self-paced 8-beat walkthrough — head-rotation extraction from raw video through to the gesture schema entry that drives the wired Rive animal.

And the source data the walkthrough is built around — me doing the encouraging gesture, raw, on a loop:

Source recording — encouraging.mp4. Phone camera, plain background, multiple takes. This single file produces ~30 measured parameters in gesture_schema_v3.json.

Detailed implementation

§1Why I pivoted

The first version of the engine had hand-picked numbers everywhere. arm_left.rot = 10. tail.rot = 8. They felt right when I was tuning one animal one time. They didn't survive the second animal, the third gesture, or the fourth iteration of "the celebration looks too small."

What I wanted was a credible source for these magnitudes. Not numbers I made up — numbers that came from a body actually doing the gesture. So I started recording myself.

Before — v1 hand-tuned

"idle": {
  "_meta": {"source": "seed"},
  "root.y": 15,
  "arm_left.rot": 10,
  "arm_right.rot": 10,
  "body.rot": 0,
  "head.rot": 0,
  "smile.intensity": 0.2
}

After — v3 measured from video

"celebrating": {
  "_meta": {
    "source": "measured",
    "n_frames": 100,
    "gesture_type": "cyclic",
    "cycle_period_frames": 13
  },
  "arm_left.rot": 82.6,
  "arm_right.rot": 84.56,
  "body.rot": 9.34,
  "root.y": 164.74,
  "smile.intensity": 0.8483
}

The _meta.source field is the contract: every entry is either seed (legacy hand-tuned, kept only for poses I haven't recorded yet) or measured (extracted from video). New gestures must be measured.

From the post-mortem on the Apr 8 sprint, the design rule I committed to:
"defaults in one place · pattern templates in one place · no hardcoded keyframe values in the engine." The engine's job is to interpolate; the schema's job is to say what the body does.

§2The source — five gesture recordings of me

Each recording is one gesture, multiple takes, phone camera, plain background. The pipeline reads data/gestures_config.json, which maps gesture name to video file:

{
  "gestures": {
    "ask_child":       { "video": "islookingatchild.mp4" },
    "celebrating":     { "video": "iscelebrating.mp4" },
    "encouraging":     { "video": "isencouragingmp4.mp4" },
    "look_at_letter":  { "video": "islookingattarget.mp4" },
    "thinking":        { "video": "isthinking.mp4" }
  }
}

Adding a sixth gesture is a single line in this file plus a new .mp4. No code change.

§3The frame dashboard — what the extractor sees

The extractor produces one set of landmarks per frame. The dashboard below shows the skeleton overlaid on the source video, with per-joint traces (arm angle, hip Y, body tilt) plotted next to it. Scrub through frames to see how each parameter moves with the gesture — this is what I look at when I'm trying to find inflection points and verify a recording is usable.

Live dashboard — every frame of iscelebrating.mp4 with extracted skeleton, arm L/R angles, hip Y, and body tilt. The graphs on the right are how I read motion.

§4The rest-frame problem

Joint magnitudes only mean something as deltas from rest. arm_left.rot = 82.6° is a celebration-sized arm swing only if I know what "arm down at the side" looks like. So I needed a credible rest pose for this body.

Why this was non-trivial: a gesture video doesn't contain a rest frame. The person is performing the entire time. Picking "frame 0" gives you the first frame of the gesture, not a neutral pose. Different methods give wildly different answers.

The solution — calmest frames across all videos

I score every frame from every gesture video on three criteria, pick the most-neutral 20 frames overall, and average them. The rest pose is the property of the person, not of any one gesture — so this cross-gesture pooling is what makes it stable:

# scripts/rest_pose_model.py:115 — compute_cross_gesture_rest()

for i in range(len(frames)):
    # Motion score (lower = more still)
    m = motion[i]

    # Arm neutrality — arms should hang down, ~0–15° from vertical
    for side in ['LEFT', 'RIGHT']:
        sx, sy = series[f'{side}_SHOULDER']['x'][i], series[f'{side}_SHOULDER']['y'][i]
        wx, wy = series[f'{side}_WRIST']['x'][i], series[f'{side}_WRIST']['y'][i]
        angle = math.degrees(math.atan2(abs(wx - sx), max(wy - sy, 0.001)))
    arm_score = mean(angles)

    # Body tilt — shoulders should be level
    tilt = math.degrees(math.atan2(dy_shoulders, max(dx_shoulders, 0.001)))

    # Combined — 40% motion stillness, 40% arm neutrality, 20% body uprightness
    combined = 0.4 * (m / 0.5) + 0.4 * (arm_score / 45.0) + 0.2 * (tilt / 10.0)

# Sort by combined score, take top 20, average → rest_pose

The output is data/rest_pose.json with normalized [0–1] (x, y) for every landmark. Stability check at the end: if the rest frames came from at least 2 different gesture videos, the rest pose is "robust." All current rest poses pass that check.

What this unlocks

With a credible rest, every gesture's parameters become honest deltas. arm_left.rot = 82.6° in celebrating means the arm rotated 82.6 degrees away from where it sits at rest — a measurement, not a guess.

§5Pixel space vs Rive space

Video frames live in pixels — a 1080×1920 phone capture. Rive lives in normalized canvas units — every animal's artboard is 1024×1024. Two different coordinate systems doing different things. The pipeline scales cross-system before any magnitude leaves the extractor:

# scripts/generate_gesture_schema.py

CANVAS_HEIGHT_PX = 1024.0
CANVAS_WIDTH_PX  = 1024.0

# root.y is a vertical bounce. Pixel delta from rest, scaled into Rive canvas space.
hip_up = rest_hip_y - all_metrics['hip_y']      # normalized [0,1]
peak_up = np.percentile(hip_up, 95)
root_y = max(0, peak_up) * 1024            # scaled to artboard height
params['root.y'] = round(root_y, 2)

Rotation parameters are unit-free (degrees) and pass through unchanged. Translation parameters (root.y, root.x_sway) get scaled into canvas units. Each manifest declares its artboard dimensions, so the same schema feeds animals of different sizes via per-animal artboard overrides.

§6Pose extraction — 133 keypoints per frame

Primary extractor is RTMLib (full body, hands, face — 133 keypoints). MediaPipe's lighter 13-landmark model is the fallback. Per gesture, ~100 frames are extracted at the source video's frame rate.

# scripts/generate_gesture_schema_v3.py:102

def extract_pose_frames(video_path, gesture, max_frames=100):
    """Extract pose using rtmlib (133 keypoints) with fallback to MediaPipe."""
    rtmlib_script = SCRIPT_DIR / 'pose_extractor_v2.py'
    if rtmlib_script.exists():
        # 133 keypoints: 17 body + 6 feet + 68 face + 42 hands
        result = subprocess.run([sys.executable, str(rtmlib_script),
                                 '--input', str(video_path),
                                 '--gesture', gesture,
                                 '--max_frames', str(max_frames)],
                                capture_output=True)
        if result.returncode == 0:
            return json.load(out)

    # Fallback to MediaPipe (13 body landmarks)
    print(f"    Falling back to MediaPipe (13 landmarks)")
    ...

For face data, MediaPipe Face Mesh runs in parallel on the same frames — produces ~50 blendshape activations (smile / squint / brow / jaw / cheek puff) plus iris-relative-to-eye coordinates for pupil tracking.

§7From 100 frames to a 30-parameter gesture vector

The phoneme classifier extracts ~300 features from a single audio sample. The gesture pipeline does the analogous thing in the visual domain: ~100 video frames → a single fixed-size parameter vector. The aggregation is 95th-percentile delta from rest pose, per body part:

# scripts/generate_gesture_schema_v3.py:440 — per-gesture aggregation

# Arm rotation
for side, series, rest_val in [
    ('left', all_metrics['arm_l'], rest_arm_l),
    ('right', all_metrics['arm_r'], rest_arm_r),
]:
    peak  = np.percentile(series, 95)        # 100 frames → single peak value
    delta = abs(peak - rest_val)            # delta from rest pose
    params[f'arm_{side}.rot'] = round(delta, 2)

# Body rotation, head rotation, head tilt — same pattern
delta_body = abs(np.percentile(all_metrics['body_tilt'], 95) - rest_body_tilt)
params['body.rot'] = round(delta_body, 2)

# Face data — averaged across frames (smile / squint / brow / jaw / pupil)
smile_l = np.mean(blendshapes.get('mouthSmileLeft', [0]))
smile_r = np.mean(blendshapes.get('mouthSmileRight', [0]))
smile = (smile_l + smile_r) / 2
if smile > 0.05:
    params['smile.intensity'] = round(smile, 4)

95th percentile is a deliberate choice: the mean understates a peaking gesture (a celebration's peak is what you remember, not its average frame), and the max is unstable to extraction noise. The 95th percentile captures "the magnitude this gesture actually reaches" without one bad frame distorting the result.

Two extra passes happen alongside the per-body-part aggregation:

Cycle detection — autocorrelation on the dominant motion series flags cyclic gestures (celebrating bounces; look_at_letter doesn't). Output: gesture_type: cyclic | dynamic | static + cycle_period_frames for cyclic ones.
Pose suppression for static gestures — if a gesture is classified as held (e.g., thinking), the engine caps spurious bounce: if gesture_type == 'static' and root_y > 50: root_y = min(root_y, 15). Body sway in a static pose is noise, not intent.

Phoneme classifier	Gesture pipeline
1 audio recording per phoneme	1 video recording per gesture
Onset / steady / full windows	Cyclic / dynamic / static classification
~300 features per sample	~30 parameters per gesture
Cross-window deltas (onset − steady)	Delta from universal rest pose
RF importance ranks 5 task-winning features	Manifest declares which parameters each animal honors

§8The schema config — one file drives everything downstream

The output of the whole extraction pipeline is a single JSON. Here's celebrating as it actually ships:

// data/gesture_schema_v3.json
{
  "meta": {
    "version": "3.0",
    "generator": "generate_gesture_schema_v3.py",
    "approach": "delta-from-universal-rest + gesture-classification",
    "rest_source": "data/rest_pose.json"
  },
  "gestures": {
    "celebrating": {
      "_meta": { "source": "measured", "n_frames": 100,
                 "gesture_type": "cyclic", "cycle_period_frames": 13 },
      "gesture_type": "cyclic",
      "arm_left.rot": 82.6,    "arm_right.rot": 84.56,
      "body.rot": 9.34,        "head.rot": 1.63,        "head.tilt": 12.93,
      "root.y": 164.74,       "root.x_sway": 159.99,
      "leg_left.rot": 84.79,  "leg_right.rot": 84.67,
      "smile.intensity": 0.8483, "eye.squint": 0.4203,
      "brow_left.y": 0.047,   "brow_right.y": 0.0415,
      "jaw.open": 0.0367,
      "cheek_left.y": 0.2545, "cheek_right.y": 0.2545,
      "pupil_left.x": -0.32,  "pupil_right.x": -0.32,
      "pupil_left.y": -7.67,  "pupil_right.y": -7.67,
      "tail.rot": 25,         "tuft.rot": 10
    }
  }
}

Every parameter is a number that came out of my body. Every animal that gets wired against this schema bounces and waves with celebration magnitudes that match what a 13-period cyclic celebration looks like on a real human.

§9What the schema informs downstream

The schema is the contract. Downstream, three things consume it:

Per-animal parameter mapping — configs/parameter_mapping_spec.json binds gesture names to runtime state-machine inputs and declares which body parts each animal honors:

"chick": {
  "runtime_inputs": {
    "isCelebrating": { "type": "bool", "gesture": "celebrating",
                       "description": "Celebration — jump, arms up, biggest gesture" },
    ...
  },
  "gesture_parameter_parts": {
    "celebrating": {
      "input": "isCelebrating",
      "parts": ["root", "body", "arm_left", "arm_right", "leg_left", ...],
      "gesture_defaults_keys": ["root.y", "body.rot", "arm.rot", "leg.rot", ...]
    }
  }
}

Motion curves & engine-ready specs — extract_motion_curves.py turns the schema into temporal motion shapes (savgol-smoothed, RDP-reduced); auto_gesture_specs.py turns those into engine-ready track specs (frame number, easing, magnitude). The wiring step never sees the raw video.
Engine constants are gone — the engine no longer carries hardcoded magnitudes. It carries pattern templates ("cyclic with period N"); the schema fills in the magnitudes.

Adding a 6th gesture = recording one new video + adding one line to gestures_config.json + running python gesture_factory.py --add new_gesture new_gesture.mp4 --wire chick. Everything else propagates automatically.

← Rive overview Config-driven pipeline →