From hardcoded offsets to data-driven, pose-based gestures
- Pose-extraction pipeline turning my own gesture videos into per-frame body data (133 keypoints + facial blendshapes + iris)
- Universal-rest-pose model so joint deltas are credible — solved the "no clear rest frame in a performing person" problem
- 95th-percentile aggregation: ~100 frames of pose data → ~30-parameter gesture schema
- v3 schema marks every parameter as
source: measuredinstead ofsource: seed— the engine no longer has hardcoded keyframe values
gesture_schema_v3.json) that drives all downstream wiring. Adding a new gesture is a recording, not a tuning session.Take a look at this brief walkthrough — eight beats, click through at your own pace, video to schema in one screen.
And the source data the walkthrough is built around — me doing the encouraging gesture, raw, on a loop:
encouraging.mp4. Phone camera, plain background, multiple takes. This single file produces ~30 measured parameters in gesture_schema_v3.json.§1Why I pivoted
The first version of the engine had hand-picked numbers everywhere. arm_left.rot = 10. tail.rot = 8. They felt right when I was tuning one animal one time. They didn't survive the second animal, the third gesture, or the fourth iteration of "the celebration looks too small."
What I wanted was a credible source for these magnitudes. Not numbers I made up — numbers that came from a body actually doing the gesture. So I started recording myself.
Before — v1 hand-tuned
"idle": { "_meta": {"source": "seed"}, "root.y": 15, "arm_left.rot": 10, "arm_right.rot": 10, "body.rot": 0, "head.rot": 0, "smile.intensity": 0.2 }
After — v3 measured from video
"celebrating": { "_meta": { "source": "measured", "n_frames": 100, "gesture_type": "cyclic", "cycle_period_frames": 13 }, "arm_left.rot": 82.6, "arm_right.rot": 84.56, "body.rot": 9.34, "root.y": 164.74, "smile.intensity": 0.8483 }
The _meta.source field is the contract: every entry is either seed (legacy hand-tuned, kept only for poses I haven't recorded yet) or measured (extracted from video). New gestures must be measured.
"defaults in one place · pattern templates in one place · no hardcoded keyframe values in the engine." The engine's job is to interpolate; the schema's job is to say what the body does.
§2The source — five gesture recordings of me
Each recording is one gesture, multiple takes, phone camera, plain background. The pipeline reads data/gestures_config.json, which maps gesture name to video file:
{
"gestures": {
"ask_child": { "video": "islookingatchild.mp4" },
"celebrating": { "video": "iscelebrating.mp4" },
"encouraging": { "video": "isencouragingmp4.mp4" },
"look_at_letter": { "video": "islookingattarget.mp4" },
"thinking": { "video": "isthinking.mp4" }
}
}
Adding a sixth gesture is a single line in this file plus a new .mp4. No code change.
§3The frame dashboard — what the extractor sees
The extractor produces one set of landmarks per frame. The dashboard below shows the skeleton overlaid on the source video, with per-joint traces (arm angle, hip Y, body tilt) plotted next to it. Scrub through frames to see how each parameter moves with the gesture — this is what I look at when I'm trying to find inflection points and verify a recording is usable.
iscelebrating.mp4 with extracted skeleton, arm L/R angles, hip Y, and body tilt. The graphs on the right are how I read motion.§4The rest-frame problem
Joint magnitudes only mean something as deltas from rest. arm_left.rot = 82.6° is a celebration-sized arm swing only if I know what "arm down at the side" looks like. So I needed a credible rest pose for this body.
The solution — calmest frames across all videos
I score every frame from every gesture video on three criteria, pick the most-neutral 20 frames overall, and average them. The rest pose is the property of the person, not of any one gesture — so this cross-gesture pooling is what makes it stable:
# scripts/rest_pose_model.py:115 — compute_cross_gesture_rest() for i in range(len(frames)): # Motion score (lower = more still) m = motion[i] # Arm neutrality — arms should hang down, ~0–15° from vertical for side in ['LEFT', 'RIGHT']: sx, sy = series[f'{side}_SHOULDER']['x'][i], series[f'{side}_SHOULDER']['y'][i] wx, wy = series[f'{side}_WRIST']['x'][i], series[f'{side}_WRIST']['y'][i] angle = math.degrees(math.atan2(abs(wx - sx), max(wy - sy, 0.001))) arm_score = mean(angles) # Body tilt — shoulders should be level tilt = math.degrees(math.atan2(dy_shoulders, max(dx_shoulders, 0.001))) # Combined — 40% motion stillness, 40% arm neutrality, 20% body uprightness combined = 0.4 * (m / 0.5) + 0.4 * (arm_score / 45.0) + 0.2 * (tilt / 10.0) # Sort by combined score, take top 20, average → rest_pose
The output is data/rest_pose.json with normalized [0–1] (x, y) for every landmark. Stability check at the end: if the rest frames came from at least 2 different gesture videos, the rest pose is "robust." All current rest poses pass that check.
What this unlocks
With a credible rest, every gesture's parameters become honest deltas. arm_left.rot = 82.6° in celebrating means the arm rotated 82.6 degrees away from where it sits at rest — a measurement, not a guess.
§5Pixel space vs Rive space
Video frames live in pixels — a 1080×1920 phone capture. Rive lives in normalized canvas units — every animal's artboard is 1024×1024. Two different coordinate systems doing different things. The pipeline scales cross-system before any magnitude leaves the extractor:
# scripts/generate_gesture_schema.py CANVAS_HEIGHT_PX = 1024.0 CANVAS_WIDTH_PX = 1024.0 # root.y is a vertical bounce. Pixel delta from rest, scaled into Rive canvas space. hip_up = rest_hip_y - all_metrics['hip_y'] # normalized [0,1] peak_up = np.percentile(hip_up, 95) root_y = max(0, peak_up) * 1024 # scaled to artboard height params['root.y'] = round(root_y, 2)
Rotation parameters are unit-free (degrees) and pass through unchanged. Translation parameters (root.y, root.x_sway) get scaled into canvas units. Each manifest declares its artboard dimensions, so the same schema feeds animals of different sizes via per-animal artboard overrides.
§6Pose extraction — 133 keypoints per frame
Primary extractor is RTMLib (full body, hands, face — 133 keypoints). MediaPipe's lighter 13-landmark model is the fallback. Per gesture, ~100 frames are extracted at the source video's frame rate.
# scripts/generate_gesture_schema_v3.py:102 def extract_pose_frames(video_path, gesture, max_frames=100): """Extract pose using rtmlib (133 keypoints) with fallback to MediaPipe.""" rtmlib_script = SCRIPT_DIR / 'pose_extractor_v2.py' if rtmlib_script.exists(): # 133 keypoints: 17 body + 6 feet + 68 face + 42 hands result = subprocess.run([sys.executable, str(rtmlib_script), '--input', str(video_path), '--gesture', gesture, '--max_frames', str(max_frames)], capture_output=True) if result.returncode == 0: return json.load(out) # Fallback to MediaPipe (13 body landmarks) print(f" Falling back to MediaPipe (13 landmarks)") ...
For face data, MediaPipe Face Mesh runs in parallel on the same frames — produces ~50 blendshape activations (smile / squint / brow / jaw / cheek puff) plus iris-relative-to-eye coordinates for pupil tracking.
§7From 100 frames to a 30-parameter gesture vector
The phoneme classifier extracts ~300 features from a single audio sample. The gesture pipeline does the analogous thing in the visual domain: ~100 video frames → a single fixed-size parameter vector. The aggregation is 95th-percentile delta from rest pose, per body part:
# scripts/generate_gesture_schema_v3.py:440 — per-gesture aggregation # Arm rotation for side, series, rest_val in [ ('left', all_metrics['arm_l'], rest_arm_l), ('right', all_metrics['arm_r'], rest_arm_r), ]: peak = np.percentile(series, 95) # 100 frames → single peak value delta = abs(peak - rest_val) # delta from rest pose params[f'arm_{side}.rot'] = round(delta, 2) # Body rotation, head rotation, head tilt — same pattern delta_body = abs(np.percentile(all_metrics['body_tilt'], 95) - rest_body_tilt) params['body.rot'] = round(delta_body, 2) # Face data — averaged across frames (smile / squint / brow / jaw / pupil) smile_l = np.mean(blendshapes.get('mouthSmileLeft', [0])) smile_r = np.mean(blendshapes.get('mouthSmileRight', [0])) smile = (smile_l + smile_r) / 2 if smile > 0.05: params['smile.intensity'] = round(smile, 4)
95th percentile is a deliberate choice: the mean understates a peaking gesture (a celebration's peak is what you remember, not its average frame), and the max is unstable to extraction noise. The 95th percentile captures "the magnitude this gesture actually reaches" without one bad frame distorting the result.
Two extra passes happen alongside the per-body-part aggregation:
- Cycle detection — autocorrelation on the dominant motion series flags cyclic gestures (
celebratingbounces;look_at_letterdoesn't). Output:gesture_type: cyclic | dynamic | static+cycle_period_framesfor cyclic ones. - Pose suppression for static gestures — if a gesture is classified as held (e.g.,
thinking), the engine caps spurious bounce:if gesture_type == 'static' and root_y > 50: root_y = min(root_y, 15). Body sway in a static pose is noise, not intent.
| Phoneme classifier | Gesture pipeline |
|---|---|
| 1 audio recording per phoneme | 1 video recording per gesture |
| Onset / steady / full windows | Cyclic / dynamic / static classification |
| ~300 features per sample | ~30 parameters per gesture |
| Cross-window deltas (onset − steady) | Delta from universal rest pose |
| RF importance ranks 5 task-winning features | Manifest declares which parameters each animal honors |
§8The schema config — one file drives everything downstream
The output of the whole extraction pipeline is a single JSON. Here's celebrating as it actually ships:
// data/gesture_schema_v3.json { "meta": { "version": "3.0", "generator": "generate_gesture_schema_v3.py", "approach": "delta-from-universal-rest + gesture-classification", "rest_source": "data/rest_pose.json" }, "gestures": { "celebrating": { "_meta": { "source": "measured", "n_frames": 100, "gesture_type": "cyclic", "cycle_period_frames": 13 }, "gesture_type": "cyclic", "arm_left.rot": 82.6, "arm_right.rot": 84.56, "body.rot": 9.34, "head.rot": 1.63, "head.tilt": 12.93, "root.y": 164.74, "root.x_sway": 159.99, "leg_left.rot": 84.79, "leg_right.rot": 84.67, "smile.intensity": 0.8483, "eye.squint": 0.4203, "brow_left.y": 0.047, "brow_right.y": 0.0415, "jaw.open": 0.0367, "cheek_left.y": 0.2545, "cheek_right.y": 0.2545, "pupil_left.x": -0.32, "pupil_right.x": -0.32, "pupil_left.y": -7.67, "pupil_right.y": -7.67, "tail.rot": 25, "tuft.rot": 10 } } }
Every parameter is a number that came out of my body. Every animal that gets wired against this schema bounces and waves with celebration magnitudes that match what a 13-period cyclic celebration looks like on a real human.
§9What the schema informs downstream
The schema is the contract. Downstream, three things consume it:
- Per-animal parameter mapping —
configs/parameter_mapping_spec.jsonbinds gesture names to runtime state-machine inputs and declares which body parts each animal honors:"chick": { "runtime_inputs": { "isCelebrating": { "type": "bool", "gesture": "celebrating", "description": "Celebration — jump, arms up, biggest gesture" }, ... }, "gesture_parameter_parts": { "celebrating": { "input": "isCelebrating", "parts": ["root", "body", "arm_left", "arm_right", "leg_left", ...], "gesture_defaults_keys": ["root.y", "body.rot", "arm.rot", "leg.rot", ...] } } }
- Motion curves & engine-ready specs —
extract_motion_curves.pyturns the schema into temporal motion shapes (savgol-smoothed, RDP-reduced);auto_gesture_specs.pyturns those into engine-ready track specs (frame number, easing, magnitude). The wiring step never sees the raw video. - Engine constants are gone — the engine no longer carries hardcoded magnitudes. It carries pattern templates ("cyclic with period N"); the schema fills in the magnitudes.
Adding a 6th gesture = recording one new video + adding one line to gestures_config.json + running python gesture_factory.py --add new_gesture new_gesture.mp4 --wire chick. Everything else propagates automatically.