Table of Contents
Fetching ...

Pre-training LLMs using human-like development data corpus

Khushi Bhardwaj, Raj Sanjay Shah, Sashank Varma

TL;DR

The paper investigates cognitively plausible pre-training by curating data scales similar to human exposure (around $10^7$ tokens) and evaluating RoBERTa-base, DistilBERT, and GPT-2 on Strict and Strict-small data with new vocabularies. It systematically explores epoch effects and hyperparameter settings, releasing Huggingface checkpoints to enable replication and further study. Key findings show that longer pre-training improves performance on BLIMP and SuperGLUE-like tasks, but initialization sensitivity remains a challenge requiring warm-up or grid-search tuning. The work demonstrates the viability of human-like data for pre-training, provides valuable baselines and resources, and discusses limitations and ethical considerations relevant to cognitive-aligned NLP research.

Abstract

Pre-trained Large Language Models (LLMs) have shown success in a diverse set of language inference and understanding tasks. The pre-training stage of LLMs looks at a large corpus of raw textual data. The BabyLM shared task compares LLM pre-training to human language acquisition, where the number of tokens seen by 13-year-old kids is magnitudes smaller than the number of tokens seen by LLMs. In this work, we pre-train and evaluate LLMs on their ability to learn contextual word representations using roughly the same number of tokens as seen by children. We provide a strong set of baselines; with different architectures, evaluation of changes in performance across epochs, and reported pre-training metrics for the strict small and strict tracks of the task. We also try to loosely replicate the RoBERTa baseline given by the task organizers to observe the training robustness to hyperparameter selection and replicability. We provide the submission details to the strict and strict-small tracks in this report.

Pre-training LLMs using human-like development data corpus

TL;DR

The paper investigates cognitively plausible pre-training by curating data scales similar to human exposure (around tokens) and evaluating RoBERTa-base, DistilBERT, and GPT-2 on Strict and Strict-small data with new vocabularies. It systematically explores epoch effects and hyperparameter settings, releasing Huggingface checkpoints to enable replication and further study. Key findings show that longer pre-training improves performance on BLIMP and SuperGLUE-like tasks, but initialization sensitivity remains a challenge requiring warm-up or grid-search tuning. The work demonstrates the viability of human-like data for pre-training, provides valuable baselines and resources, and discusses limitations and ethical considerations relevant to cognitive-aligned NLP research.

Abstract

Pre-trained Large Language Models (LLMs) have shown success in a diverse set of language inference and understanding tasks. The pre-training stage of LLMs looks at a large corpus of raw textual data. The BabyLM shared task compares LLM pre-training to human language acquisition, where the number of tokens seen by 13-year-old kids is magnitudes smaller than the number of tokens seen by LLMs. In this work, we pre-train and evaluate LLMs on their ability to learn contextual word representations using roughly the same number of tokens as seen by children. We provide a strong set of baselines; with different architectures, evaluation of changes in performance across epochs, and reported pre-training metrics for the strict small and strict tracks of the task. We also try to loosely replicate the RoBERTa baseline given by the task organizers to observe the training robustness to hyperparameter selection and replicability. We provide the submission details to the strict and strict-small tracks in this report.
Paper Structure (13 sections, 9 tables)