Table of Contents
Fetching ...

Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

Michael Y. Hu, Aaron Mueller, Candace Ross, Adina Williams, Tal Linzen, Chengxu Zhuang, Ryan Cotterell, Leshem Choshen, Alex Warstadt, Ethan Gotlieb Wilcox

TL;DR

The paper analyzes the second BabyLM Challenge, focusing on data-efficient language modeling under a fixed 100-million-word budget, including new text corpora and a multimodal dataset. It reports 31 submissions from 17 countries, with a hybrid causal-masked model (GPT-BERT) achieving top performance on text tasks, and a strong positive link between training FLOPs and scores. While improvements are observed in text-only tracks, the multimodal track did not yield submissions surpassing baselines, highlighting ongoing challenges in grounding and vision-language integration at small data scales. The study also identifies impactful directions such as dataset construction, multi-objective training, and tokenization innovations, and provides resources and an evaluation pipeline to foster reproducible, community-driven progress toward cognitively plausible, data-efficient language models.

Abstract

The BabyLM Challenge is a community effort to close the data-efficiency gap between human and computational language learners. Participants compete to optimize language model training on a fixed language data budget of 100 million words or less. This year, we released improved text corpora, as well as a vision-and-language corpus to facilitate research into cognitively plausible vision language models. Submissions were compared on evaluation tasks targeting grammatical ability, (visual) question answering, pragmatic abilities, and grounding, among other abilities. Participants could submit to a 10M-word text-only track, a 100M-word text-only track, and/or a 100M-word and image multimodal track. From 31 submissions employing diverse methods, a hybrid causal-masked language model architecture outperformed other approaches. No submissions outperformed the baselines in the multimodal track. In follow-up analyses, we found a strong relationship between training FLOPs and average performance across tasks, and that the best-performing submissions proposed changes to the training data, training objective, and model architecture. This year's BabyLM Challenge shows that there is still significant room for innovation in this setting, in particular for image-text modeling, but community-driven research can yield actionable insights about effective strategies for small-scale language modeling.

Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

TL;DR

The paper analyzes the second BabyLM Challenge, focusing on data-efficient language modeling under a fixed 100-million-word budget, including new text corpora and a multimodal dataset. It reports 31 submissions from 17 countries, with a hybrid causal-masked model (GPT-BERT) achieving top performance on text tasks, and a strong positive link between training FLOPs and scores. While improvements are observed in text-only tracks, the multimodal track did not yield submissions surpassing baselines, highlighting ongoing challenges in grounding and vision-language integration at small data scales. The study also identifies impactful directions such as dataset construction, multi-objective training, and tokenization innovations, and provides resources and an evaluation pipeline to foster reproducible, community-driven progress toward cognitively plausible, data-efficient language models.

Abstract

The BabyLM Challenge is a community effort to close the data-efficiency gap between human and computational language learners. Participants compete to optimize language model training on a fixed language data budget of 100 million words or less. This year, we released improved text corpora, as well as a vision-and-language corpus to facilitate research into cognitively plausible vision language models. Submissions were compared on evaluation tasks targeting grammatical ability, (visual) question answering, pragmatic abilities, and grounding, among other abilities. Participants could submit to a 10M-word text-only track, a 100M-word text-only track, and/or a 100M-word and image multimodal track. From 31 submissions employing diverse methods, a hybrid causal-masked language model architecture outperformed other approaches. No submissions outperformed the baselines in the multimodal track. In follow-up analyses, we found a strong relationship between training FLOPs and average performance across tasks, and that the best-performing submissions proposed changes to the training data, training objective, and model architecture. This year's BabyLM Challenge shows that there is still significant room for innovation in this setting, in particular for image-text modeling, but community-driven research can yield actionable insights about effective strategies for small-scale language modeling.

Paper Structure

This paper contains 56 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: A breakdown of the various approaches used in the 2024 BabyLM challenge, organized by category and track. Curriculum learning again takes the top spot as the most popular approach, followed by training objective innovations.
  • Figure 2: Overall results: At left, multimodal models on multimodal tasks; at right, all models on text tasks. N.B. Human scores for multimodal evals differ somewhat from how we evaluate our models.
  • Figure 3: The relationship between training FLOPs and final score.
  • Figure 4: Scores aggregated by backbone architecture. Colors indicate different submissions.
  • Figure 5: Scores on the BabyLM challenge, aggregated by approach. Colors indicate different submissions, which are plotted twice if they use more than one approach. Axes are zoomed to show variation in the 45-60 range more clearly.