From Cosine Similarity to 97.4% — Building the Phoneme Classifier
§1See it running
The L2R Android app prompts a sound, listens for the kid's response, runs both classifiers on-device, and decides feedback in under 50 ms.
§2The architecture that shipped
Hierarchical routing, not a single ensemble. Track B picks the manner; Track A's per-manner classifiers refine to a specific phoneme. Each within-manner classifier is a single model — the family was picked per-task by cross-validation.
How it actually runs at inference time
Track B fires once on the raw audio — picks one of six manner classes. Only that manner branch fires. No softmax across branches, no cross-manner ensembling. For plosive, two Track A classifiers run in parallel (voicing × place → cross-product yields the phoneme). For fricative / nasal / approximant / vowel, one classifier fires. Affricate currently routes to /tʃ/ by default — voicing classifier for /tʃ/ vs /dʒ/ is on the Track A roadmap.
Why a different model family per Track A task
Each within-manner task has a different feature-importance shape, so a different model wins. Plosive voicing is dominated by one feature (f_vot_ms) with a clear threshold — Random Forest handles it cleanly. Plosive place is monotonic in burst peak frequency — Logistic Regression fits. Sibilance has nonlinear feature interactions across the 4–8 kHz band — ExtraTrees (random splits) captures that. Nasal place is small-data, narrow-feature — Gradient Boosting's bias correction wins. Each family was picked per-task by cross-validation, not by global selection.
Why hierarchical instead of flat 26-way
Different manner classes have fundamentally different acoustic signatures (Stevens 1998 §7.2). A plosive burst is a 5–20 ms transient; a fricative is 50–200 ms of sustained noise. A flat 26-way classifier would have to use one feature recipe for both. Hierarchical lets each branch use its best features and its best model family.
§3Stack
Python (research + training)
- WavLM base+ via
transformers, fine-tuned in PyTorch (2-phase: head only → full unfreeze) - ONNX export — FP32 / FP16 / INT8 variants benchmarked; FP16 shipped
librosafeature extraction — 4 extractor scripts producing ~300 features per samplescikit-learnfor Track A — Random Forest, Logistic Regression, ExtraTrees, Gradient Boosting, SVC (picked per task by CV)- LightGBM as an interpretability cross-check on feature importance
- Domain-informed augmentation: weak-burst plosives, reduced-frication fricatives, vowel-like glides, denasalization
Kotlin (Android production)
NeuralPhonemeClassifier— ONNX Runtime wrapper for Track BPhonemeClassifier— Track A within-manner heads (cosine + tier system)DevelopmentalFeedbackPolicy— 3-axis policy: Sander mastery age + confidence threshold + dev-substitution tableVTLNCalibrator— child speech vocal-tract normalization (factor 1.104 from child F0 mean)NoiseRejector— 4-gate audio quality filter (VAD + SNR + spectral flatness + duration)
/w/ for /r/ at age 3). The policy combines per-phoneme mastery age (/m/→2yo, /r/→7yo) + model confidence ≥ 0.7 + a known-substitution lookup table. Simulation result: 100% accuracy on real feedback, zero false negatives on developmentally-typical pronunciations. A 3-year-old is never marked wrong for a normal sound.
§4The features that mattered most
Each Track A within-manner task picks its own features. Feature importance and CV accuracy decide both the top feature and the model family. The winning feature for each task is grounded in published phonetics literature.
| Within-manner task | Top feature | Model | All data | Child only |
|---|---|---|---|---|
| Plosive voicing b/p, d/t, g/k |
f_vot_msvoice onset time |
RF | 91.1% | 94.3% |
| Plosive place bilabial · alveolar · velar |
f_burst_peak_hzburst spectral peak |
LR | 71.4% | 68.6% |
| Fricative sibilance s/z vs f/v |
f_sibilance4–8 kHz energy ratio |
ET | 83.7% | 70.8% |
| Nasal place m · n · ŋ |
d_nasal_bandnarrowband 250–350 Hz delta |
GB | 100% | 71.4% |
| Approximant type l · r · w · y |
s_f3_hzthird formant frequency |
LR | 95.8% | 85.7% |
| Vowel letter | general featuresF1 / F2 / MFCCs |
ET | 66.7% | 54.5% |
Each top feature is the one that phonetics literature predicts should distinguish those phonemes. f_vot_ms for voicing (Lisker & Abramson 1964), f_burst_peak_hz for plosive place (Stevens 1998), f_sibilance for sibilant fricatives (Jongman et al. 2000), narrowband nasal formant for nasal place (Stevens 1998), F3 for /r/-vs-/l/ (Espy-Wilson 1992). The classical model learned the phonetics curriculum. The full feature catalog and contrastive-pair visualizations live on the feature engineering page.
Vowel is the bottleneck — currently lacks F1/F2 formants in the 84-feature overlap available at training time. Full feature re-extraction is queued (Track A roadmap).
§5Evolution of accuracy
Six iterations. One domain insight took Track B from 89.5% → 97.4%.
Five iterations on Track B. The +7.9% delta from 89.5% → 97.4% came not from more data or a bigger model — it came from one domain insight about child plosive burst energy ↓.
Originally shipping in L2R V1: 5 hand-weighted features — MFCC 70% + spectral centroid 10% + ZCR 10% + duration 5% + RMS 5%. Cosine match against 26 reference profiles, one feature set for every manner class.
108 verified samples, 284 features (the full classical catalog). Two evaluation regimes: 5-fold CV and speaker-stratified (train = adult + 4yo, test = 2yo child).
The 23-point drop between CV and speaker-stratified is the entire problem. Classical features overfit to speaker characteristics; they don't generalize across age.
WavLM base+ as a frozen feature extractor, classical head (RF / LR / SVC) on top. Same speaker-stratified split. +14% over the MFCC baseline.
WavLM chosen over wav2vec2 and HuBERT — Blockmedin et al. (Interspeech 2024) showed WavLM is the best-performing SSL model for children's phoneme recognition.
Two-phase fine-tune: (1) classification head only, (2) unfreeze top WavLM layers and train full model.
Phase 1 GO/NO-GO gate: 71.4%. Phase 2 final: 89.5%.
Inspecting errors: 4 of 38 wrong, every single one a 2-year-old plosive. Adult and 4yo plosives were fine; 2yo plosives were systematically misclassified.
The insight. I went back to the audio and measured the misclassifications. 2-year-olds produce plosives with half the burst energy of 4-year-olds. Their consonant onsets are softer, often vowelized, sometimes prevoiced.
phonics_d.wav from a 2-year-old (left) and a 4-year-old (right), same in-home phone-mic sessions used for training. The first 80 ms burst window is highlighted on each. RMS energy in that window: 0.0895 at age 2 vs 0.1465 at age 4 — the 2-year-old produces the burst at ≈ 61% of the older child's energy. Across the corpus the average ratio is closer to ½; /d/ visualizes the pattern cleanly. Generic noise/pitch augmentation can't recreate this — the structure is qualitatively different, not just quieter.
First attempt — plosive-only augmentation: 92.1%. Synthesized weak-burst, vowelized, prevoiced plosive variants. +2.6% on plosives — but this skewed the training distribution toward plosives and introduced regressions in fricatives and nasals.
Final — balanced augmentation across all classes: 97.4%. Same domain-informed augmentation philosophy applied to every class proportionally:
- Plosive: weak-burst, vowelized, prevoiced
- Fricative: reduced-frication noise (child /s/ is often weaker)
- Approximant: more vowel-like glides (child /r/→/w/)
- Nasal: denasalization variants
All 4 child-plosive errors fixed. Zero regressions in other classes.
What I tested and discarded
- Knowledge distillation: trained a tiny student model. Failed at 94 samples — not enough signal for the smaller capacity. Kept full WavLM via Play Asset Delivery.
- Volume normalization: RMS / peak normalization changed 0 predictions across the test set. Removed from pipeline.
- WavLM layer probing: probed all 12 transformer layers. Layer 10 wins for pretrained probing, layer 7 wins after fine-tune. Informed the fine-tuning unfreeze schedule.
Source: neural_deep_analysis.py. I tested my assumptions — two failed, one informed the architecture.
§6References
Stevens, K.N. (1998) · Acoustic Phonetics, MIT Press — manner class signatures, plosive place, nasal anti-formants
Sander, E.K. (1972) · "When are speech sounds learned?" JSHD 37(1) — phoneme acquisition timeline
Jongman, A., Wayland, R. & Wong, S. (2000) · "Acoustic characteristics of English fricatives," JASA 108(3) — sibilance band
Espy-Wilson, C.Y. (1992) · "Acoustic measures for semivowels /w j r l/," JASA 92(2) — F3 for /r/ vs /l/
Lee, S., Potamianos, A. & Narayanan, S. (1999) · "Acoustics of children's speech," JASA 105(3) — child formant frequencies, VTLN basis
Howell, P. & Rosen, S. (1983) · "Production and perception of rise time in the affricate-fricative distinction," JASA 73(3)
Smit, A.B. et al. (1990) · "The Iowa Articulation Norms Project" — acquisition ages
Blockmedin et al. (2024) · "Self-supervised phoneme recognition for children's reading," Interspeech 2024 — WavLM base+ selected as best SSL model