Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models
Francesca Padovani, Jaap Jumelet, Yevgen Matusevych, Arianna Bisazza
TL;DR
<3-5 sentence high-level summary> The paper re-evaluates whether child-directed language (CDL) provides consistent syntactic gains for language models compared to adult-directed language (ADL) across English, French, and German, using RoBERTa- and GPT-2–style architectures. It introduces FIT-CLAMS, a frequency-controlled minimal-pair evaluation, to separate genuine syntactic generalization from lexical frequency effects. Across benchmarks (BLiMP, Zorro, CLAMS) and languages, CDL often underperforms or shows mixed effects relative to Wikipedia-trained models, with any CDL advantage largely tied to question phenomena. Regression analyses show model performance correlates modestly with lexical frequency, underscoring the importance of frequency control in assessing syntactic learning. The authors advocate integrating CDL insights into cognitively grounded or interactive training paradigms and using CDL to inform inductive biases and data augmentation rather than as a straightforward pretraining resource.</3-5 sentence high-level summary>
Abstract
Seminal work by Huebner et al. (2021) showed that language models (LMs) trained on English Child-Directed Language (CDL) can reach similar syntactic abilities as LMs trained on much larger amounts of adult-directed written text, suggesting that CDL could provide more effective LM training material than the commonly used internet-crawled data. However, the generalizability of these results across languages, model types, and evaluation settings remains unclear. We test this by comparing models trained on CDL vs. Wikipedia across two LM objectives (masked and causal), three languages (English, French, German), and three syntactic minimal-pair benchmarks. Our results on these benchmarks show inconsistent benefits of CDL, which in most cases is outperformed by Wikipedia models. We then identify various shortcomings in previous benchmarks, and introduce a novel testing methodology, FIT-CLAMS, which uses a frequency-controlled design to enable balanced comparisons across training corpora. Through minimal pair evaluations and regression analysis we show that training on CDL does not yield stronger generalizations for acquiring syntax and highlight the importance of controlling for frequency effects when evaluating syntactic ability.
