A systematic investigation of learnability from single child linguistic input
Yulu Qin, Wentao Wang, Brenden M. Lake
TL;DR
The paper investigates how learnability for language can emerge when a model is trained on the limited linguistic input a single child encounters. It expands prior work by evaluating six architectures across five datasets (three single-child corpora plus baselines) and by using diverse evaluation methods, including the Zorro grammaticality suite, embedding visualizations, and cloze tests. Across configurations, the study finds robust emergence of syntactic and semantic structure and selective sensitivity to linguistic phenomena, mirroring prior single-child studies. The results suggest that data-efficient, child-directed input can support meaningful linguistic representations across architectures, with implications for cognitive modeling and the realism of language-learning simulations, while highlighting limitations and directions for future multi-modal research.
Abstract
Language models (LMs) have demonstrated remarkable proficiency in generating linguistically coherent text, sparking discussions about their relevance to understanding human language learnability. However, a significant gap exists between the training data for these models and the linguistic input a child receives. LMs are typically trained on data that is orders of magnitude larger and fundamentally different from child-directed speech (Warstadt and Bowman, 2022; Warstadt et al., 2023; Frank, 2023a). Addressing this discrepancy, our research focuses on training LMs on subsets of a single child's linguistic input. Previously, Wang, Vong, Kim, and Lake (2023) found that LMs trained in this setting can form syntactic and semantic word clusters and develop sensitivity to certain linguistic phenomena, but they only considered LSTMs and simpler neural networks trained from just one single-child dataset. Here, to examine the robustness of learnability from single-child input, we systematically train six different model architectures on five datasets (3 single-child and 2 baselines). We find that the models trained on single-child datasets showed consistent results that matched with previous work, underscoring the robustness of forming meaningful syntactic and semantic representations from a subset of a child's linguistic input.
