← back to portfolio
Problem: Speech recognition for 4-year-olds, on-device, on a phone. Without sending audio of children to any cloud.
Why it matters: Cloud calls = COPPA compliance overhead + latency + ongoing inference cost. The constraint: fit accurate phoneme recognition in tens of KB that runs on a low-end Android in <100ms.
What I built:
Result: 97.0% accuracy on full 66-phoneme test set. 100% on Diane's voice baseline. Production-shipping in L2R Android app.
Stack: Python (librosa, scikit-learn, ONNX, PyTorch for Track B), Kotlin (custom PhonemeClassifier on Android)
Learn to Read · ML / Audio · On-Device

97% Phonics Accuracy, No Cloud

On-device speech recognition for 4-year-olds. 66 phonemes. 15.4KB shipped asset. Zero cloud calls. Grounded in published phonetics research.
97%
accuracy (66 phonemes)
15.4KB
shipped Android asset
0
cloud calls
Most "ML in production" stories scale up — bigger models, more data, more compute. This one scales down: how small can the model be while still working for the actual user — a 3-year-old whose voice doesn't match any adult speech corpus?
▶ See it running
Watch the classifier work on my kid in the Learn to Read Android app. 15.4 KB. 48 ms latency. Zero cloud calls.
↓ Press play
/æ/ vowel
/f/ fricative
Feature engineering
Click here to learn how I extracted ~300 features from my own sound recordings — and which 5 won each within-manner task.
Read the deep-dive →
audio
Track B — WavLM · 97.4%
plosive
fricative
nasal
vowel
RF
ET
GB
ET
Model building
Click here to learn about the classifier evolution — cosine similarity → WavLM fine-tune → +7.9% from one domain insight.
Read the deep-dive →
Grounded in phonetics — hierarchical manner classifier

Building a flat 24-way consonant classifier is wrong. Different manner classes have fundamentally different acoustic signatures — you can't use the same features for a plosive burst (5–20ms transient) and a fricative (50–200ms sustained noise). The classifier is hierarchical: manner first, then within-class discrimination.

Plosives (stops)
/p b t d k g/
Complete closure + sudden burst release. Key feature: VOT (ms) — voiced /b/ = ~11ms, voiceless /p/ = ~58ms. Burst spectral peak indicates place. Stevens (1998) §7.
Fricatives
/f v θ ð s z ʃ ʒ h/
Sustained turbulent noise. Sibilants (/s z ʃ ʒ/) have 15–25dB more energy than non-sibilants (/f v θ ð/) — most reliable single split. spectral centroid determines place. Jongman et al. (2000).
Nasals
/m n ŋ/
Velum lowers, air through nasal cavity. Strong ~250Hz murmur + anti-formants. F2 transition locus distinguishes place: /m/ ~1kHz, /n/ ~1.7kHz, /ŋ/ ~2kHz+.
Approximants
/l r w j/
Formant structure (vowel-like) with constriction. Critical cue for /r/ vs /l/: F3 — /r/ = 1300–1700Hz, /l/ = 2200Hz+. Espy-Wilson (1992).
Affricates
/tʃ dʒ/
Stop + fricative in sequence. Rise time <30ms = affricate; >50ms = fricative. Distinguishes /tʃ/ from /ʃ/. Howell & Rosen (1983).
Vowels
/æ ɛ ɪ ɑ ʌ/ + diphthongs
F1 (height), F2 (front/back). Short vowels for Phase 1 (CVC words). VTLN warp normalizes child F0 before formant comparison. Lee et al. (1999).
Why child speech is different — developmental acquisition

A 3-year-old who says "wabbit" for "rabbit" is developmentally normal. The classifier must know the child's age and accept developmentally-expected substitutions — marking them wrong would be pedagogically harmful. Data from Sander (1972), McLeod & Crowe (2018).

mastered by age 2–3
Early sounds
/m b p n d t w h/ — bilabial and alveolar, visible lip/tongue placement, earliest acquired
mastered by age 3–4
Middle sounds
/k g f ŋ j/ — velar stops, labiodental fricative. Back-of-mouth sounds develop as articulator control improves
mastered by age 4–6
Late sounds
/l r s z ʃ tʃ dʒ v/ — require precise tongue groove (/s/), sustained labiodental contact (/v/), tongue-tip control (/l/)
mastered by age 6–7+
Latest sounds
/θ ð ʒ/ — dental fricatives, rare phonemes. /r/ production variable through age 8 in some children
Hear it yourself — audio QA tools

The classifier is only as good as the audio it's trained against. These are the ear-check tools I use to verify every one of the 66 sounds shipping in the L2R Android app.

👶 Child voice samples (2yo + 4yo, Diane-verified) → 🔊 Synthetic-TTS QA dashboard →
👶 Child voice samples — actual ground-truth recordings: my kids at age 2 and 4, Diane-verified. Grounds the VTLN factor and 97.4% manner accuracy.
🔊 Synthetic-TTS QA — ear-check the GCP Neural2 generated audio evaluated as a baseline.