Table of Contents
Fetching ...

Lil-Bevo: Explorations of Strategies for Training Language Models in More Humanlike Ways

Venkata S Govindarajan, Juan Diego Rodriguez, Kaj Bostrom, Kyle Mahowald

TL;DR

Lil-Bevo investigates training language models with human-scale data through three strategies: short-sequence curriculum, music pretraining, and targeted MLM. Using a DeBERTa encoder, the study demonstrates that short-sequence training consistently benefits performance, music pretraining offers small, task-specific gains, and targeted MLM yields selective improvements on certain BLiMP tasks but not a broad uplift. Ablations reveal that longer sequence pretraining underperforms shorter sequences, and that the combined approach (Lil-Bevo) often edges out the short+target baseline yet remains far from matching large LLMs trained on vastly more data. The work suggests that human-scale language learning benefits may require further integration of data types, masking strategies, and curriculum refinements, and advocates for broader collaborative evaluation within BabyLM to identify robust, scalable gains.

Abstract

We present Lil-Bevo, our submission to the BabyLM Challenge. We pretrained our masked language models with three ingredients: an initial pretraining with music data, training on shorter sequences before training on longer ones, and masking specific tokens to target some of the BLiMP subtasks. Overall, our baseline models performed above chance, but far below the performance levels of larger LLMs trained on more data. We found that training on short sequences performed better than training on longer sequences.Pretraining on music may help performance marginally, but, if so, the effect seems small. Our targeted Masked Language Modeling augmentation did not seem to improve model performance in general, but did seem to help on some of the specific BLiMP tasks that we were targeting (e.g., Negative Polarity Items). Training performant LLMs on small amounts of data is a difficult but potentially informative task. While some of our techniques showed some promise, more work is needed to explore whether they can improve performance more than the modest gains here. Our code is available at https://github.com/venkatasg/Lil-Bevo and out models at https://huggingface.co/collections/venkatasg/babylm-653591cdb66f4bf68922873a

Lil-Bevo: Explorations of Strategies for Training Language Models in More Humanlike Ways

TL;DR

Lil-Bevo investigates training language models with human-scale data through three strategies: short-sequence curriculum, music pretraining, and targeted MLM. Using a DeBERTa encoder, the study demonstrates that short-sequence training consistently benefits performance, music pretraining offers small, task-specific gains, and targeted MLM yields selective improvements on certain BLiMP tasks but not a broad uplift. Ablations reveal that longer sequence pretraining underperforms shorter sequences, and that the combined approach (Lil-Bevo) often edges out the short+target baseline yet remains far from matching large LLMs trained on vastly more data. The work suggests that human-scale language learning benefits may require further integration of data types, masking strategies, and curriculum refinements, and advocates for broader collaborative evaluation within BabyLM to identify robust, scalable gains.

Abstract

We present Lil-Bevo, our submission to the BabyLM Challenge. We pretrained our masked language models with three ingredients: an initial pretraining with music data, training on shorter sequences before training on longer ones, and masking specific tokens to target some of the BLiMP subtasks. Overall, our baseline models performed above chance, but far below the performance levels of larger LLMs trained on more data. We found that training on short sequences performed better than training on longer sequences.Pretraining on music may help performance marginally, but, if so, the effect seems small. Our targeted Masked Language Modeling augmentation did not seem to improve model performance in general, but did seem to help on some of the specific BLiMP tasks that we were targeting (e.g., Negative Polarity Items). Training performant LLMs on small amounts of data is a difficult but potentially informative task. While some of our techniques showed some promise, more work is needed to explore whether they can improve performance more than the modest gains here. Our code is available at https://github.com/venkatasg/Lil-Bevo and out models at https://huggingface.co/collections/venkatasg/babylm-653591cdb66f4bf68922873a
Paper Structure (25 sections, 1 figure, 4 tables)

This paper contains 25 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Results for each model, for each task. The color reflects the difference in score between the given model and the RoBERTa baseline results released by the organizers of BabyLM.