← back to portfolio
Problem: Speech recognition for 4-year-olds, on-device, on a phone. Without sending audio of children to any cloud.
Why it matters: Cloud calls = COPPA compliance overhead + latency + ongoing inference cost. The constraint: fit accurate phoneme recognition in tens of KB that runs on a low-end Android in <100ms.
What I built:
- Two-track classifier — Track A (MFCC + cosine, 15.4KB asset), Track B (WavLM fine-tune, ONNX export)
- VTLN factor 1.104 derived from child F0 mean (269 Hz) — adult-trained features warped to child vocal tract
- Sander 1972 substitution table for age-gated developmental allowances (/t/→/k/ at age 2-3 is development, not error)
- 4-gate noise rejector for bedroom-recording realism
- False-rejection target <5%: better to accept a wrong sound than reject a correct one
Result: 97.0% accuracy on full 66-phoneme test set. 100% on Diane's voice baseline. Production-shipping in L2R Android app.
Stack: Python (librosa, scikit-learn, ONNX, PyTorch for Track B), Kotlin (custom PhonemeClassifier on Android)
Learn to Read · ML / Audio · On-Device
97% Phonics Accuracy, No Cloud
On-device speech recognition for 4-year-olds. 66 phonemes. 15.4KB shipped asset. Zero cloud calls. Grounded in published phonetics research.
97%
accuracy (66 phonemes)
15.4KB
shipped Android asset
Most "ML in production" stories scale up — bigger models, more data, more compute. This one scales down: how small can the model be while still working for the actual user — a 3-year-old whose voice doesn't match any adult speech corpus?
Grounded in phonetics — hierarchical manner classifier
Building a flat 24-way consonant classifier is wrong. Different manner classes have fundamentally different acoustic signatures — you can't use the same features for a plosive burst (5–20ms transient) and a fricative (50–200ms sustained noise). The classifier is hierarchical: manner first, then within-class discrimination.
Plosives (stops)
/p b t d k g/
Complete closure + sudden burst release. Key feature: VOT (ms) — voiced /b/ = ~11ms, voiceless /p/ = ~58ms. Burst spectral peak indicates place. Stevens (1998) §7.
Fricatives
/f v θ ð s z ʃ ʒ h/
Sustained turbulent noise. Sibilants (/s z ʃ ʒ/) have 15–25dB more energy than non-sibilants (/f v θ ð/) — most reliable single split. spectral centroid determines place. Jongman et al. (2000).
Nasals
/m n ŋ/
Velum lowers, air through nasal cavity. Strong ~250Hz murmur + anti-formants. F2 transition locus distinguishes place: /m/ ~1kHz, /n/ ~1.7kHz, /ŋ/ ~2kHz+.
Approximants
/l r w j/
Formant structure (vowel-like) with constriction. Critical cue for /r/ vs /l/: F3 — /r/ = 1300–1700Hz, /l/ = 2200Hz+. Espy-Wilson (1992).
Affricates
/tʃ dʒ/
Stop + fricative in sequence. Rise time <30ms = affricate; >50ms = fricative. Distinguishes /tʃ/ from /ʃ/. Howell & Rosen (1983).
Vowels
/æ ɛ ɪ ɑ ʌ/ + diphthongs
F1 (height), F2 (front/back). Short vowels for Phase 1 (CVC words). VTLN warp normalizes child F0 before formant comparison. Lee et al. (1999).
Why child speech is different — developmental acquisition
A 3-year-old who says "wabbit" for "rabbit" is developmentally normal. The classifier must know the child's age and accept developmentally-expected substitutions — marking them wrong would be pedagogically harmful. Data from Sander (1972), McLeod & Crowe (2018).
mastered by age 2–3
Early sounds
/m b p n d t w h/ — bilabial and alveolar, visible lip/tongue placement, earliest acquired
mastered by age 3–4
Middle sounds
/k g f ŋ j/ — velar stops, labiodental fricative. Back-of-mouth sounds develop as articulator control improves
mastered by age 4–6
Late sounds
/l r s z ʃ tʃ dʒ v/ — require precise tongue groove (/s/), sustained labiodental contact (/v/), tongue-tip control (/l/)
mastered by age 6–7+
Latest sounds
/θ ð ʒ/ — dental fricatives, rare phonemes. /r/ production variable through age 8 in some children
Hear it yourself — audio QA tools
The classifier is only as good as the audio it's trained against. These are the ear-check tools I use to verify every one of the 66 sounds shipping in the L2R Android app.