Table of Contents
Fetching ...

Call for Papers -- The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

Alex Warstadt, Leshem Choshen, Aaron Mueller, Adina Williams, Ethan Wilcox, Chengxu Zhuang

TL;DR

The BabyLM Challenge tackles sample-efficient pretraining using a developmentally plausible corpus to study data-limited language learning and cognitive modeling. It defines three tracks (Strict, Strict-small, Loose) with fixed 10M/100M word datasets or up to 100M words, plus a Colab-based evaluation pipeline accessible to diverse research groups. The paper specifies a child-centric dataset, a standardized evaluation, and baseline configurations (OPT, RoBERTa, T5) to benchmark data efficiency, while allowing flexible submissions and prioritizing cognitive-relevant outcomes. By enabling low-resource experimentation and providing open evaluation, the work aims to advance data-efficient NLP and human-language acquisition research on university budgets.

Abstract

We present the call for papers for the BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus. This shared task is intended for participants with an interest in small scale language modeling, human language acquisition, low-resource NLP, and cognitive modeling. In partnership with CoNLL and CMCL, we provide a platform for approaches to pretraining with a limited-size corpus sourced from data inspired by the input to children. The task has three tracks, two of which restrict the training data to pre-released datasets of 10M and 100M words and are dedicated to explorations of approaches such as architectural variations, self-supervised objectives, or curriculum learning. The final track only restricts the amount of text used, allowing innovation in the choice of the data, its domain, and even its modality (i.e., data from sources other than text is welcome). We will release a shared evaluation pipeline which scores models on a variety of benchmarks and tasks, including targeted syntactic evaluations and natural language understanding.

Call for Papers -- The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

TL;DR

The BabyLM Challenge tackles sample-efficient pretraining using a developmentally plausible corpus to study data-limited language learning and cognitive modeling. It defines three tracks (Strict, Strict-small, Loose) with fixed 10M/100M word datasets or up to 100M words, plus a Colab-based evaluation pipeline accessible to diverse research groups. The paper specifies a child-centric dataset, a standardized evaluation, and baseline configurations (OPT, RoBERTa, T5) to benchmark data efficiency, while allowing flexible submissions and prioritizing cognitive-relevant outcomes. By enabling low-resource experimentation and providing open evaluation, the work aims to advance data-efficient NLP and human-language acquisition research on university budgets.

Abstract

We present the call for papers for the BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus. This shared task is intended for participants with an interest in small scale language modeling, human language acquisition, low-resource NLP, and cognitive modeling. In partnership with CoNLL and CMCL, we provide a platform for approaches to pretraining with a limited-size corpus sourced from data inspired by the input to children. The task has three tracks, two of which restrict the training data to pre-released datasets of 10M and 100M words and are dedicated to explorations of approaches such as architectural variations, self-supervised objectives, or curriculum learning. The final track only restricts the amount of text used, allowing innovation in the choice of the data, its domain, and even its modality (i.e., data from sources other than text is welcome). We will release a shared evaluation pipeline which scores models on a variety of benchmarks and tasks, including targeted syntactic evaluations and natural language understanding.
Paper Structure (15 sections, 1 figure, 1 table)

This paper contains 15 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Data Scale: Modern Language Models are trained on data multiple orders of magnitude larger than the amount available to a typical human child. Image based off Fig. 1 from warstadt2022artificial