Do Syntactic Categories Help in Developmentally Motivated Curriculum Learning for Language Models?
Arzu Burcu Güven, Anna Rogers, Rob van der Goot
TL;DR
This work investigates whether developmentally motivated curriculum learning can leverage syntactic information to improve language-model training. It introduces a syntax-based labeling toolkit using ~300 Tregex patterns to categorize sentences into 13 Grambank-aligned categories, achieving about 71% coverage on CHILDES, and applies this to BabyLM data. The authors find no strong age-based syntactic differentiation in CHILDES, and show that training on syntactically categorized data can match a baseline with about 40% fewer training steps, with mixed results on cross-construction generalization and some gains on reading-related tasks. Overall, the results suggest that data quality and targeted syntactic signal may be more impactful than curriculum design alone, and they provide an open-source toolkit to enable further, more controlled investigations into syntactic curriculum learning for language models.
Abstract
We examine the syntactic properties of BabyLM corpus, and age-groups within CHILDES. While we find that CHILDES does not exhibit strong syntactic differentiation by age, we show that the syntactic knowledge about the training data can be helpful in interpreting model performance on linguistic tasks. For curriculum learning, we explore developmental and several alternative cognitively inspired curriculum approaches. We find that some curricula help with reading tasks, but the main performance improvement come from using the subset of syntactically categorizable data, rather than the full noisy corpus.
