Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction
James A. Michaelov, Catherine Arnett
TL;DR
The paper investigates how language models acquire grammatical knowledge, focusing on subject-verb agreement and attraction effects, and shows that aggregate accuracy can mask key intermediate learning dynamics. It introduces a psycholinguistic-inspired disaggregation method that analyzes performance across data subsets and training time using the PolyPythia model suite and log-probability metrics. The results reveal an $n$-gram-like progression: models begin with unigram frequency biases, then incorporate local context and attraction effects, and finally achieve broader generalization as dependencies lengthen, supporting the notion of 'hidden breakthroughs' in grammatical learning. This approach provides interpretable diagnostics for training dynamics and informs evaluation of grammatical generalization benchmarks beyond aggregate metrics.
Abstract
Language models generally produce grammatical text, but they are more likely to make errors in certain contexts. Drawing on paradigms from psycholinguistics, we carry out a fine-grained analysis of those errors in different syntactic contexts. We demonstrate that by disaggregating over the conditions of carefully constructed datasets and comparing model performance on each over the course of training, it is possible to better understand the intermediate stages of grammatical learning in language models. Specifically, we identify distinct phases of training where language model behavior aligns with specific heuristics such as word frequency and local context rather than generalized grammatical rules. We argue that taking this approach to analyzing language model behavior more generally can serve as a powerful tool for understanding the intermediate learning phases, overall training dynamics, and the specific generalizations learned by language models.
