PROBLEM: The neural across-manner classifier hits 97.4% but is opaque. How do you explain to a stakeholder why a model thinks the letter "f" is a fricative?
WHY IT MATTERS: A WavLM fine-tune is a 96M-parameter black box. To trust the model in production with kids' voices, I cross-check with a classical LightGBM that exposes feature importance — and verify the top features match what phonetics theory predicts.
STACK: Python (librosa for feature extraction, LightGBM for the cross-check classifier), Plotly.js (this page) for interactive visualization. Feature values are computed from phonics_a.wav and phonics_f.wav using the exact extractor that produced the trained model.
Spectral Features — /a/ vs /f/
Vowel vs fricative, audio → features → classification — grounded in the trained LGBM importance ranking
"The neural classifier (WavLM fine-tune) hits 97.4% on across-manner classification, but it's a transformer — no interpretable feature importance. So I trained a classical LightGBM on the same data with hand-engineered features. Its top features match phonetics theory exactly: spectral centroid change, flatness change, MFCCs at specific bands, F3 formant. That's how I explain to a stakeholder why the model behaves as it does."
① Listen to the two phonemes
Hear before you analyze. Vowel /æ/ is sustained and tonal. Fricative /f/ is short and noisy.
/æ/ vowel — the letter "a"
/f/ fricative — the letter "f"
② Waveform — time domain
Vowel /æ/: periodic, regular oscillation (vocal-fold vibration). Fricative /f/: aperiodic, noise-like (turbulent airflow through the lips/teeth).
③ Mel spectrogram — time × frequency × energy
Hot colors = more energy. Vowel: stacked horizontal bands (formants F1 ~700Hz, F2 ~1700Hz). Fricative: diffuse high-frequency wash (4–8 kHz energy from turbulence). This is the input the WavLM model sees.
④ Top 8 features by LightGBM importance
Real importances from phoneme_lgbm_model.json. Click any feature row to see the explanation; the bars show the actual computed value for /a/ vs /f/, normalized so you can see relative magnitude.
💡 Hover any feature to highlight the difference. Click to expand the explanation.
How the trained classifier connects to phonetics theory
The four highest-importance features for the LGBM phoneme classifier are: d_centroid, d_flatness, d_rms, f_m11_mean. The d_ prefix means "delta" — the change in that feature between the onset window (first 30ms) and the steady window (middle 20–80%).
Phonetics theory (per neural_phonics/scripts/neural_within_manner.py §7.2) predicts:
Fricatives are distinguished by spectral centroid, spectral peak, and sibilance (4–8 kHz turbulence)
Vowels are distinguished by F1 (height) and F2 (front-back) formants — low-frequency harmonic structure
This is exactly what the LGBM ranks at the top: spectral centroid change (#1) captures fricative onset brightness, flatness change (#2) captures noise vs. tone, and the F3 formant (rank 6) is direct linguistic theory. The trained classifier learned the phonetics curriculum.
Real feature values computed from: phoneme_qa/phonics_a.wav, phoneme_qa/phonics_f.wav
Importance source: learn_to_read/scripts/phoneme_lgbm_model.json (LightGBM model trained on 108 phoneme samples)
Domain theory source: neural_phonics/scripts/neural_within_manner.py §7.2
Cross-validated against neural WavLM-manner-finetuned at 97.4% across-manner accuracy