Phoneme Classifier — On-Device Speech Recognition for 4-Year-Olds

Phoneme Classifier — On-Device Speech Recognition

66 phonemes · 97% accuracy · 15.4KB shipped asset · zero cloud calls

Most "ML in production" stories scale up — bigger models, more data, more compute. This one scales down: how small can the model be while still working for the actual user (a kid whose voice doesn't match any adult speech corpus)?

The two-track approach

Two parallel classifier tracks were developed and evaluated. Track A is the production-shipping baseline; Track B is the research direction for harder phoneme distinctions.

Track A — Classical (production)

MFCC features (13 coefficients + delta + delta-delta)
Spectral centroid, ZCR, HF ratio for fricative/plosive distinction
Cosine match against 66 reference profiles
Manner classifier: plosive / fricative / nasal / vowel
VOT detector for voiced/unvoiced plosive split
Reference profiles: 15.4KB compact Android asset
Augmentation impact: +4.1% fricative sibilance, +2.9% plosive place accuracy

Track B — Neural (research)

WavLM fine-tune for child speech
ONNX export for on-device inference (Sherpa-ONNX runtime)
Targets the 2 phonetically-identical confusions Track A can't resolve
Larger model (~few MB); production-readiness gated on memory budget
Status: separate repo, classifier complete, integration TBD

Confusion matrix — 64/66 correct (97.0%)

True / Predicted	ee	ea	tch	ch	(other 62)	Notes
ee	✓	→	—	—	—	"ee" sometimes classified as "ea"
ea	→	✓	—	—	—	"ea" sometimes classified as "ee"
tch	—	—	✓	→	—	"tch" sometimes classified as "ch"
ch	—	—	→	✓	—	"ch" sometimes classified as "tch"
(other 62)	—	—	—	—	✓✓✓	Perfect diagonal — 62 / 62 correct

True / Predicted

tch

(other 62)

Notes

✓

→

—

"ee" sometimes classified as "ea"

→

✓

—

"ea" sometimes classified as "ee"

tch

—

✓

→

—

"tch" sometimes classified as "ch"

—

→

✓

—

"ch" sometimes classified as "tch"

(other 62)

—

✓✓✓

Perfect diagonal — 62 / 62 correct

Why this works on a 4-year-old's voice

Adult speech corpora (LibriSpeech, Common Voice) don't represent child voices. Children have higher fundamental frequency, shorter vocal tracts, and different formant ratios. Models trained only on adult data systematically fail on kids.

Adaptation	What it does	Impact
VTLN (Vocal Tract Length Normalization)	Frequency-warps adult-trained features to match child vocal tract.	Factor 1.104 derived from child F0 mean of 269 Hz. Brings child speech into adult feature space.
Sander 1972 substitution table	Age-gated phoneme substitution — accepts `/w/` for `/r/` at age 3, but not at age 6.	Avoids penalizing developmentally-typical pronunciations.
4-gate noise rejector	VAD + SNR + spectral flatness + duration gate before classification.	Rejects bedroom noise, sibling speech, breath noise without false positives on quiet voices.
Recording-on-device	Audio never leaves the phone.	COPPA-compliant by architecture: there's no parental consent dance for cloud transmission because there's no transmission.

Adaptation

What it does

Impact

VTLN (Vocal Tract Length Normalization)

Frequency-warps adult-trained features to match child vocal tract.

Factor 1.104 derived from child F0 mean of 269 Hz. Brings child speech into adult feature space.

Sander 1972 substitution table

Age-gated phoneme substitution — accepts /w/ for /r/ at age 3, but not at age 6.

Avoids penalizing developmentally-typical pronunciations.

4-gate noise rejector

VAD + SNR + spectral flatness + duration gate before classification.

Rejects bedroom noise, sibling speech, breath noise without false positives on quiet voices.

Recording-on-device

Audio never leaves the phone.

COPPA-compliant by architecture: there's no parental consent dance for cloud transmission because there's no transmission.

Stack — Python research → Kotlin production

Python research

phoneme_classifier.py — MFCC + cosine, 100% accuracy on Diane's recordings
manner_detector.py — manner class detection
vot_detector.py — voice onset time
spectral_matcher.py — fricatives + vowels
confusion_matrix.py — full matrix + HTML report
generate_reference_profiles.py — builds the 15.4KB asset
child_speech_test.py — VTLN validation on real child data
run_full_qa.sh — one-command full QA pipeline

Kotlin production (`com.readingpractice.audio`)

PhonemeFeatureExtractor — MFCC, spectral, ZCR, HF ratio
PhonemeClassifier — cosine match against profiles, age-adjusted tiers
AudioCaptureManager — mic capture with VAD
PhonemeFeedbackEngine — capture → features → classify → feedback
VTLNCalibrator — child speech normalization
PhonemeProgressTracker — mastery levels per phoneme
DevelopmentalSubstitutions — Sander 1972 table
NoiseRejector — 4-gate noise rejection
120 unit tests across 9 test files

Hear it yourself — interactive QA dashboard

The classifier is only as good as the audio it's trained against. This is the ear-check tool I use to verify every one of the 66 sounds shipping in the L2R Android app — phoneme-by-phoneme audio playback, side-by-side comparison against older versions, approve/reject UI driven from a phone.

Production source: projects/learn_to_read/scripts/ (Python) + app/src/main/java/com/readingpractice/audio/ (Kotlin)
Track A (classical): projects/track_a_classical/
Track B (neural): projects/neural_phonics/
Phonics classifier (research): projects/phonics_classifier/