[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

Leshem Choshen; Ryan Cotterell; Michael Y. Hu; Tal Linzen; Aaron Mueller; Candace Ross; Alex Warstadt; Ethan Wilcox; Adina Williams; Chengxu Zhuang

[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

Leshem Choshen, Ryan Cotterell, Michael Y. Hu, Tal Linzen, Aaron Mueller, Candace Ross, Alex Warstadt, Ethan Wilcox, Adina Williams, Chengxu Zhuang

TL;DR

The paper presents the Call for Papers for the 2nd BabyLM Challenge, detailing rule changes for 2024/2025 to broaden participation and foster cognitively motivated language modeling under developmentally plausible data constraints. It introduces a paper-only track, relaxes the fixed corpus requirement, and adds a vision-language track with a curated 50/50 text-image multimodal dataset, alongside updated data provisions and evaluation tools. The document outlines evaluation pipelines (catwalk), baselines, submission channels (OpenReview and Dynabench), and a detailed FAQ to guide participants, including data-sheets for self-constructed corpora. Overall, it provides a practical framework to advance sample-efficient pretraining and multimodal reasoning within accessible budgets, enabling broader research impact.

Abstract

After last year's successful BabyLM Challenge, the competition will be hosted again in 2024/2025. The overarching goals of the challenge remain the same; however, some of the competition rules will be different. The big changes for this year's competition are as follows: First, we replace the loose track with a paper track, which allows (for example) non-model-based submissions, novel cognitively-inspired benchmarks, or analysis techniques. Second, we are relaxing the rules around pretraining data, and will now allow participants to construct their own datasets provided they stay within the 100M-word or 10M-word budget. Third, we introduce a multimodal vision-and-language track, and will release a corpus of 50% text-only and 50% image-text multimodal data as a starting point for LM model training. The purpose of this CfP is to provide rules for this year's challenge, explain these rule changes and their rationale in greater detail, give a timeline of this year's competition, and provide answers to frequently asked questions from last year's challenge.

[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

TL;DR

Abstract

[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

Authors

TL;DR

Abstract

Table of Contents