BabAR: from phoneme recognition to developmental measures of young children's speech production

Marvin Lavechin; Elika Bergelson; Roger Levy

BabAR: from phoneme recognition to developmental measures of young children's speech production

Marvin Lavechin, Elika Bergelson, Roger Levy

TL;DR

TinyVox is used to train BabAR, a cross-linguistic phoneme recognition system for child speech, and it is found that pretraining the system on multilingual child-centered daylong recordings substantially outperforms alternatives, and that providing 20 seconds of surrounding audio context during fine-tuning further improves performance.

Abstract

Studying early speech development at scale requires automatic tools, yet automatic phoneme recognition, especially for young children, remains largely unsolved. Building on decades of data collection, we curate TinyVox, a corpus of more than half a million phonetically transcribed child vocalizations in English, French, Portuguese, German, and Spanish. We use TinyVox to train BabAR, a cross-linguistic phoneme recognition system for child speech. We find that pretraining the system on multilingual child-centered daylong recordings substantially outperforms alternatives, and that providing 20 seconds of surrounding audio context during fine-tuning further improves performance. Error analyses show that substitutions predominantly fall within the same broad phonetic categories, suggesting suitability for coarse-grained developmental analyses. We validate BabAR by showing that its automatic measures of speech maturity align with developmental estimates from the literature.

BabAR: from phoneme recognition to developmental measures of young children's speech production

TL;DR

Abstract

Paper Structure (22 sections, 2 equations, 5 figures, 3 tables)

This paper contains 22 sections, 2 equations, 5 figures, 3 tables.

Introduction
Methods
Datasets
TinyVox (training, validation, and test sets)
The SEEDLingS corpus (held-out set)
Automatic phoneme recognition
Self-supervised models
Context-aware fine-tuning with extended audio
Baselines
Evaluation metric
Implementation details
Results
Which self-supervised model performs best?
How much context helps?
What types of errors does BabAR make?
...and 7 more sections

Figures (5)

Figure 1: Age distribution (panel a) and language distribution (panel b) of phonetically transcribed utterances in TinyVox.
Figure 2: Validation phoneme error rate (%, lower is better) for different self-supervised models fine-tuned on TinyVox. Means and standard deviations are computed across 5 training seeds.
Figure 3: Validation phoneme error rate (%, lower is better) for BabAR (BabyHuBERT fine-tuned on TinyVox) as a function of context duration $c$. $c = 0$ corresponds to the model receiving only the target child speech utterances (using human-annotated boundaries). Error bars represent the standard deviation across 5 training seeds. (N.B.: truncated y-axis)
Figure 4: Substitution matrices for vowels (panel a) and consonants (panel b) indicating which substitution errors BabAR makes. The darker the cell, the higher the substitution rate. All numbers are computed on the test set of TinyVox. Substitutions were more likely within vowel/consonant categories (in each outlined square) than across them.
Figure 5: Proportion of utterances with consonant-vowel (CV) or vowel-consonant (VC) transitions as a function of age (in months). Gray lines show individual trajectories computed by BabAR for 44 American English-learning children from SEEDLingS, and the blue curve shows the corresponding average. The orange curve shows the average trajectory derived from manual annotation from a meta-analysis by Cychosz & Long (2025). Shaded areas indicate $95\%$ confidence intervals.

BabAR: from phoneme recognition to developmental measures of young children's speech production

TL;DR

Abstract

BabAR: from phoneme recognition to developmental measures of young children's speech production

Authors

TL;DR

Abstract

Table of Contents

Figures (5)