Table of Contents
Fetching ...

BabAR: from phoneme recognition to developmental measures of young children's speech production

Marvin Lavechin, Elika Bergelson, Roger Levy

TL;DR

TinyVox is used to train BabAR, a cross-linguistic phoneme recognition system for child speech, and it is found that pretraining the system on multilingual child-centered daylong recordings substantially outperforms alternatives, and that providing 20 seconds of surrounding audio context during fine-tuning further improves performance.

Abstract

Studying early speech development at scale requires automatic tools, yet automatic phoneme recognition, especially for young children, remains largely unsolved. Building on decades of data collection, we curate TinyVox, a corpus of more than half a million phonetically transcribed child vocalizations in English, French, Portuguese, German, and Spanish. We use TinyVox to train BabAR, a cross-linguistic phoneme recognition system for child speech. We find that pretraining the system on multilingual child-centered daylong recordings substantially outperforms alternatives, and that providing 20 seconds of surrounding audio context during fine-tuning further improves performance. Error analyses show that substitutions predominantly fall within the same broad phonetic categories, suggesting suitability for coarse-grained developmental analyses. We validate BabAR by showing that its automatic measures of speech maturity align with developmental estimates from the literature.

BabAR: from phoneme recognition to developmental measures of young children's speech production

TL;DR

TinyVox is used to train BabAR, a cross-linguistic phoneme recognition system for child speech, and it is found that pretraining the system on multilingual child-centered daylong recordings substantially outperforms alternatives, and that providing 20 seconds of surrounding audio context during fine-tuning further improves performance.

Abstract

Studying early speech development at scale requires automatic tools, yet automatic phoneme recognition, especially for young children, remains largely unsolved. Building on decades of data collection, we curate TinyVox, a corpus of more than half a million phonetically transcribed child vocalizations in English, French, Portuguese, German, and Spanish. We use TinyVox to train BabAR, a cross-linguistic phoneme recognition system for child speech. We find that pretraining the system on multilingual child-centered daylong recordings substantially outperforms alternatives, and that providing 20 seconds of surrounding audio context during fine-tuning further improves performance. Error analyses show that substitutions predominantly fall within the same broad phonetic categories, suggesting suitability for coarse-grained developmental analyses. We validate BabAR by showing that its automatic measures of speech maturity align with developmental estimates from the literature.
Paper Structure (22 sections, 2 equations, 5 figures, 3 tables)

This paper contains 22 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Age distribution (panel a) and language distribution (panel b) of phonetically transcribed utterances in TinyVox.
  • Figure 2: Validation phoneme error rate (%, lower is better) for different self-supervised models fine-tuned on TinyVox. Means and standard deviations are computed across 5 training seeds.
  • Figure 3: Validation phoneme error rate (%, lower is better) for BabAR (BabyHuBERT fine-tuned on TinyVox) as a function of context duration $c$. $c = 0$ corresponds to the model receiving only the target child speech utterances (using human-annotated boundaries). Error bars represent the standard deviation across 5 training seeds. (N.B.: truncated y-axis)
  • Figure 4: Substitution matrices for vowels (panel a) and consonants (panel b) indicating which substitution errors BabAR makes. The darker the cell, the higher the substitution rate. All numbers are computed on the test set of TinyVox. Substitutions were more likely within vowel/consonant categories (in each outlined square) than across them.
  • Figure 5: Proportion of utterances with consonant-vowel (CV) or vowel-consonant (VC) transitions as a function of age (in months). Gray lines show individual trajectories computed by BabAR for 44 American English-learning children from SEEDLingS, and the blue curve shows the corresponding average. The orange curve shows the average trajectory derived from manual annotation from a meta-analysis by Cychosz & Long (2025). Shaded areas indicate $95\%$ confidence intervals.