Table of Contents
Fetching ...

Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models

Francesca Padovani, Jaap Jumelet, Yevgen Matusevych, Arianna Bisazza

TL;DR

<3-5 sentence high-level summary> The paper re-evaluates whether child-directed language (CDL) provides consistent syntactic gains for language models compared to adult-directed language (ADL) across English, French, and German, using RoBERTa- and GPT-2–style architectures. It introduces FIT-CLAMS, a frequency-controlled minimal-pair evaluation, to separate genuine syntactic generalization from lexical frequency effects. Across benchmarks (BLiMP, Zorro, CLAMS) and languages, CDL often underperforms or shows mixed effects relative to Wikipedia-trained models, with any CDL advantage largely tied to question phenomena. Regression analyses show model performance correlates modestly with lexical frequency, underscoring the importance of frequency control in assessing syntactic learning. The authors advocate integrating CDL insights into cognitively grounded or interactive training paradigms and using CDL to inform inductive biases and data augmentation rather than as a straightforward pretraining resource.</3-5 sentence high-level summary>

Abstract

Seminal work by Huebner et al. (2021) showed that language models (LMs) trained on English Child-Directed Language (CDL) can reach similar syntactic abilities as LMs trained on much larger amounts of adult-directed written text, suggesting that CDL could provide more effective LM training material than the commonly used internet-crawled data. However, the generalizability of these results across languages, model types, and evaluation settings remains unclear. We test this by comparing models trained on CDL vs. Wikipedia across two LM objectives (masked and causal), three languages (English, French, German), and three syntactic minimal-pair benchmarks. Our results on these benchmarks show inconsistent benefits of CDL, which in most cases is outperformed by Wikipedia models. We then identify various shortcomings in previous benchmarks, and introduce a novel testing methodology, FIT-CLAMS, which uses a frequency-controlled design to enable balanced comparisons across training corpora. Through minimal pair evaluations and regression analysis we show that training on CDL does not yield stronger generalizations for acquiring syntax and highlight the importance of controlling for frequency effects when evaluating syntactic ability.

Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models

TL;DR

<3-5 sentence high-level summary> The paper re-evaluates whether child-directed language (CDL) provides consistent syntactic gains for language models compared to adult-directed language (ADL) across English, French, and German, using RoBERTa- and GPT-2–style architectures. It introduces FIT-CLAMS, a frequency-controlled minimal-pair evaluation, to separate genuine syntactic generalization from lexical frequency effects. Across benchmarks (BLiMP, Zorro, CLAMS) and languages, CDL often underperforms or shows mixed effects relative to Wikipedia-trained models, with any CDL advantage largely tied to question phenomena. Regression analyses show model performance correlates modestly with lexical frequency, underscoring the importance of frequency control in assessing syntactic learning. The authors advocate integrating CDL insights into cognitively grounded or interactive training paradigms and using CDL to inform inductive biases and data augmentation rather than as a straightforward pretraining resource.</3-5 sentence high-level summary>

Abstract

Seminal work by Huebner et al. (2021) showed that language models (LMs) trained on English Child-Directed Language (CDL) can reach similar syntactic abilities as LMs trained on much larger amounts of adult-directed written text, suggesting that CDL could provide more effective LM training material than the commonly used internet-crawled data. However, the generalizability of these results across languages, model types, and evaluation settings remains unclear. We test this by comparing models trained on CDL vs. Wikipedia across two LM objectives (masked and causal), three languages (English, French, German), and three syntactic minimal-pair benchmarks. Our results on these benchmarks show inconsistent benefits of CDL, which in most cases is outperformed by Wikipedia models. We then identify various shortcomings in previous benchmarks, and introduce a novel testing methodology, FIT-CLAMS, which uses a frequency-controlled design to enable balanced comparisons across training corpora. Through minimal pair evaluations and regression analysis we show that training on CDL does not yield stronger generalizations for acquiring syntax and highlight the importance of controlling for frequency effects when evaluating syntactic ability.

Paper Structure

This paper contains 20 sections, 1 equation, 13 figures, 9 tables.

Figures (13)

  • Figure 1: CHILDES age distribution across languages.
  • Figure 2: Accuracy of our models on the individual paradigms in the new set of minimal pairs, FIT-CLAMS.
  • Figure 3: Relation between LM accuracy on FIT-CLAMS and proportion of variance ($R^2$) explained by the OLS regression fitted on lexical frequency factors. The lower the $R^2$ is, the less the LM's behavior is driven by lexical frequency. Each LM configuration is represented by four data points: three individual LMs (random seeds) and the average of the three (highlighted with black outline).
  • Figure 4: Word Frequency Distribution (CHILDES vs Wikipedia) across languages.
  • Figure 5: Sentence Length Distribution (CHILDES vs Wikipedia) across languages and data types
  • ...and 8 more figures