A Unified Assessment of the Poverty of the Stimulus Argument for Neural Language Models

Xiulin Yang; Arianna Bisazza; Nathan Schneider; Ethan Gotlieb Wilcox

A Unified Assessment of the Poverty of the Stimulus Argument for Neural Language Models

Xiulin Yang, Arianna Bisazza, Nathan Schneider, Ethan Gotlieb Wilcox

TL;DR

This paper tests the Poverty of the Stimulus (PoS) claim for neural language learners using PoSH-Bench, a developmentally plausible training/evaluation suite targeting key PoS phenomena. Using GPT-2 variants trained on 10–50M words with manipulated direct evidence, the authors show that transformers can generalize above chance even without direct positive evidence, but remain less data-efficient than children. They further test three cognitively motivated inductive biases and find that while these biases improve general syntactic competence, they do not close the PoS-specific efficiency gap. The findings challenge the view that innate, language-specific constraints are the sole route to robust generalization and suggest that human-like data efficiency may require additional mechanisms, possibly multimodal or caregiver-mediated signals.

Abstract

How can children acquire native-level syntax from limited input? According to the Poverty of the Stimulus Hypothesis (PoSH), the linguistic input children receive is insufficient to explain certain generalizations that are robustly learned; innate linguistic constraints, many have argued, are thus necessary to explain language learning. Neural language models, which lack such language-specific constraints in their design, offer a computational test of this longstanding (but controversial) claim. We introduce \poshbench, a training-and-evaluation suite targeting question formation, islands to movement, and other English phenomena at the center of the PoSH arguments. Training Transformer models on 10--50M words of developmentally plausible text, we find indications of generalization on all phenomena even without direct positive evidence -- yet neural models remain less data-efficient and their generalizations are weaker than those of children. We further enhance our models with three recently proposed cognitively motivated inductive biases. We find these biases improve general syntactic competence but not \poshbench performance. Our findings challenge the claim that innate syntax is the only possible route to generalization, while suggesting that human-like data efficiency requires inductive biases beyond those tested here.

A Unified Assessment of the Poverty of the Stimulus Argument for Neural Language Models

TL;DR

Abstract

Paper Structure (56 sections, 1 equation, 5 figures, 13 tables)

This paper contains 56 sections, 1 equation, 5 figures, 13 tables.

Introduction
Background & Related Work
The Heart of the Learnability Debate
PoS phenomena Studied
Yes/No Question Formation (QF)
Definition & Human Evidence
Computational Modeling
Island Constraints
Definition & Human Evidence
Computational Modeling
Binding
Definition & Human Evidence
Computational Modeling
Wanna-Contraction
Definition & Human Evidence
...and 41 more sections

Figures (5)

Figure 1: Performance of human vs. transformer learners with roughly the same amount of input, where model training sizes (10M, 30M) are aligned with the estimated cumulative input of children at specific developmental stages. Note: While differences in evaluation protocols between behavioral studies and model evaluations preclude direct quantitative comparison, the contrasting trajectories highlight that children acquire most of these phenomena with greater data efficiency within a limited scale.
Figure 2: Poverty of the Stimulus Category-wise average for each benchmark. The horizontal dashed line represents chance-level performance. The shadow/error bar represents SD across 3 random seeds.
Figure 3: Bayesian modeling results showing effects of filtering, training corpus, data size, and PoS categories. Positive effect indicates higher scores.
Figure 4: Results of GPT-2 mini with/without different inductive biases. Top: Per-category breakdown. Bottom: Benchmark overall averages. Shaded: $\pm1$ SD across 3 random seeds. Solid darker line: Vanilla transformer baseline without any modifications.
Figure 5: Performance of different models on selected PoS islands and binding phenomena in each benchmark (the performance is average of three random seeds)

A Unified Assessment of the Poverty of the Stimulus Argument for Neural Language Models

TL;DR

Abstract

A Unified Assessment of the Poverty of the Stimulus Argument for Neural Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)