← Back to portfolio
PROBLEM: Speech recognition for 4-year-olds. On-device. On a phone. Without sending audio of children to any cloud.
WHY IT MATTERS: Cloud calls would mean COPPA compliance overhead, latency, network dependency, and ongoing inference cost per child. Most speech models are megabytes-to-gigabytes. The problem is fitting accurate phoneme recognition in tens of kilobytes that runs on a low-end Android in <100ms.
STACK: Python (librosa, scikit-learn, ONNX, PyTorch for Track B), Kotlin (custom PhonemeClassifier on Android with MFCC extractor + cosine matcher)

Phoneme Classifier — On-Device Speech Recognition

66 phonemes · 97% accuracy · 15.4KB shipped asset · zero cloud calls
Most "ML in production" stories scale up — bigger models, more data, more compute. This one scales down: how small can the model be while still working for the actual user (a kid whose voice doesn't match any adult speech corpus)?

The two-track approach

Two parallel classifier tracks were developed and evaluated. Track A is the production-shipping baseline; Track B is the research direction for harder phoneme distinctions.

Track A — Classical (production)
Track B — Neural (research)

Confusion matrix — 64/66 correct (97.0%)

Test set: 66 phonemes (26 single letters + 40 letter combos / digraphs / trigrams) · Accuracy: 97.0% (64 of 66 correct) · Diane's voice baseline: 100%
Only 2 confusions in the entire matrix, and both are phonetically identical pairs that no human grader would mark wrong:
True / Predicted ee ea tch ch (other 62) Notes
ee "ee" sometimes classified as "ea"
ea "ea" sometimes classified as "ee"
tch "tch" sometimes classified as "ch"
ch "ch" sometimes classified as "tch"
(other 62) ✓✓✓ Perfect diagonal — 62 / 62 correct
Both confusion pairs are phonetically identical in English: /iː/ for ee/ea, /tʃ/ for tch/ch. A human grader would not mark these wrong. The classifier disambiguates by spelling context in the L2R app, not by acoustic difference.

Why this works on a 4-year-old's voice

Adult speech corpora (LibriSpeech, Common Voice) don't represent child voices. Children have higher fundamental frequency, shorter vocal tracts, and different formant ratios. Models trained only on adult data systematically fail on kids.

Adaptation What it does Impact
VTLN (Vocal Tract Length Normalization) Frequency-warps adult-trained features to match child vocal tract. Factor 1.104 derived from child F0 mean of 269 Hz. Brings child speech into adult feature space.
Sander 1972 substitution table Age-gated phoneme substitution — accepts /w/ for /r/ at age 3, but not at age 6. Avoids penalizing developmentally-typical pronunciations.
4-gate noise rejector VAD + SNR + spectral flatness + duration gate before classification. Rejects bedroom noise, sibling speech, breath noise without false positives on quiet voices.
Recording-on-device Audio never leaves the phone. COPPA-compliant by architecture: there's no parental consent dance for cloud transmission because there's no transmission.

Stack — Python research → Kotlin production

Python research

Kotlin production (com.readingpractice.audio)

Hear it yourself — interactive QA dashboard

The classifier is only as good as the audio it's trained against. This is the ear-check tool I use to verify every one of the 66 sounds shipping in the L2R Android app — phoneme-by-phoneme audio playback, side-by-side comparison against older versions, approve/reject UI driven from a phone.

🔊 Open the Phoneme Audio QA Dashboard → 📊 Spectral feature walkthrough — /a/ vs /f/ →
Audio QA dashboard: filter by manner class, compare new neural audio vs currently-shipped, approve/reject UI.
Spectral feature walkthrough: interactive vowel-vs-fricative comparison — waveform, spectrogram, top-8 features from the trained LightGBM ranked by real importance values. Built to explain to a non-ML stakeholder why the classifier behaves the way it does.
Outcome:
97.0% accuracy on the 66-phoneme test set (only 2 phonetically-identical confusions).
100% accuracy on Diane's voice (development baseline).
15.4KB reference-profile asset shipping in production L2R Android app (Play Store release build passed).
Zero cloud calls — COPPA-compliant by architecture.