Table of Contents
Fetching ...

BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models

Marvin Lavechin, Yaya Sy, Hadrien Titeux, María Andrea Cruz Blandón, Okko Räsänen, Hervé Bredin, Emmanuel Dupoux, Alejandrina Cristia

TL;DR

BabySLM tackles how to evaluate self-supervised spoken language models under developmentally plausible input. It introduces a lexical spot-the-word task and a grammatical-acceptability task using child-directed training data (Providence, SEEDLingS) and compares speech-based approaches (STELA) with text-based LMs (LSTM variants, BabyBERTa). Results reveal a persistent gap between speech- and text-based models and a pronounced gap between clean audiobook speech and in-the-wild long-form speech, identifying two key hurdles for realistic cognitive modeling. The work provides a practical benchmark to guide future data collection, model design, and cross-language benchmarking in child-directed speech.

Abstract

Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels. In order to fully realize the potential of these approaches and further our understanding of how infants learn language, simulations must closely emulate real-life situations by training on developmentally plausible corpora and benchmarking against appropriate test sets. To this end, we propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels, both of which are compatible with the vocabulary typical of children's language experiences. This paper introduces the benchmark and summarizes a range of experiments showing its usefulness. In addition, we highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.

BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models

TL;DR

BabySLM tackles how to evaluate self-supervised spoken language models under developmentally plausible input. It introduces a lexical spot-the-word task and a grammatical-acceptability task using child-directed training data (Providence, SEEDLingS) and compares speech-based approaches (STELA) with text-based LMs (LSTM variants, BabyBERTa). Results reveal a persistent gap between speech- and text-based models and a pronounced gap between clean audiobook speech and in-the-wild long-form speech, identifying two key hurdles for realistic cognitive modeling. The work provides a practical benchmark to guide future data collection, model design, and cross-language benchmarking in child-directed speech.

Abstract

Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels. In order to fully realize the potential of these approaches and further our understanding of how infants learn language, simulations must closely emulate real-life situations by training on developmentally plausible corpora and benchmarking against appropriate test sets. To this end, we propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels, both of which are compatible with the vocabulary typical of children's language experiences. This paper introduces the benchmark and summarizes a range of experiments showing its usefulness. In addition, we highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.
Paper Structure (13 sections, 2 figures, 3 tables)

This paper contains 13 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Language modeling from text to speech. Top panel shows the lexical accuracy obtained by language models trained on audio (STELA) or phonemes (LSTM). Bottom panel shows the syntactic accuracy obtained by language models trained on audio (STELA) or byte-pair-encoded (BPE) words (LSTM). All models are trained on the Providence corpora in audio, phonetic, or orthographic form. Numbers are computed on the test set. Error bars represent standard errors computed across mutually exclusive training sets.
  • Figure 2: Language modeling from clean to in-the-wild speech. Lexical accuracy obtained by STELA trained on audiobooks (Libri-light, in blue) or child-centered long-forms (SEEDLingS, in orange) as a function of speech quantity. Numbers are computed on the test set. Error bars represent standard errors computed across mutually exclusive training sets.