Table of Contents
Fetching ...

BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings

Théo Charlot, Tarek Kunze, Maxime Poli, Alejandrina Cristia, Emmanuel Dupoux, Marvin Lavechin

TL;DR

BabyHuBERT is introduced, a self-supervised speech model trained on 13,000 hours of multilingual child-centered recordings spanning 40+ languages, demonstrating effectiveness on underrepresented languages and sharing code and model to support researchers working with child-centered recordings across diverse linguistic contexts.

Abstract

Child-centered daylong recordings are essential for studying early language development, but existing speech models trained on clean adult data perform poorly due to acoustic and linguistic differences. We introduce BabyHuBERT, a self-supervised speech model trained on 13,000 hours of multilingual child-centered recordings spanning 40+ languages. Evaluated on voice type classification -- distinguishing target children from female adults, male adults, and other children, a key preprocessing step for analyzing naturalistic language experiences -- BabyHuBERT-VTC achieves F1-scores from 52.1% to 74.4% across six corpora, consistently outperforming W2V2-LL4300 (English daylongs) and HuBERT (clean adult speech). Notable gains include 13.2 and 15.9 absolute F1 points over HuBERT on Vanuatu and Solomon Islands, demonstrating effectiveness on underrepresented languages. We share code and model to support researchers working with child-centered recordings across diverse linguistic contexts.

BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings

TL;DR

BabyHuBERT is introduced, a self-supervised speech model trained on 13,000 hours of multilingual child-centered recordings spanning 40+ languages, demonstrating effectiveness on underrepresented languages and sharing code and model to support researchers working with child-centered recordings across diverse linguistic contexts.

Abstract

Child-centered daylong recordings are essential for studying early language development, but existing speech models trained on clean adult data perform poorly due to acoustic and linguistic differences. We introduce BabyHuBERT, a self-supervised speech model trained on 13,000 hours of multilingual child-centered recordings spanning 40+ languages. Evaluated on voice type classification -- distinguishing target children from female adults, male adults, and other children, a key preprocessing step for analyzing naturalistic language experiences -- BabyHuBERT-VTC achieves F1-scores from 52.1% to 74.4% across six corpora, consistently outperforming W2V2-LL4300 (English daylongs) and HuBERT (clean adult speech). Notable gains include 13.2 and 15.9 absolute F1 points over HuBERT on Vanuatu and Solomon Islands, demonstrating effectiveness on underrepresented languages. We share code and model to support researchers working with child-centered recordings across diverse linguistic contexts.

Paper Structure

This paper contains 12 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Comparison of the performance obtained by BabyHuBERT-VTC, W2V2-LL4300 and HuBERT (all fine-tuned on BabyTrain-2025) across corpora of the test set.