Table of Contents
Fetching ...

Do Syntactic Categories Help in Developmentally Motivated Curriculum Learning for Language Models?

Arzu Burcu Güven, Anna Rogers, Rob van der Goot

TL;DR

This work investigates whether developmentally motivated curriculum learning can leverage syntactic information to improve language-model training. It introduces a syntax-based labeling toolkit using ~300 Tregex patterns to categorize sentences into 13 Grambank-aligned categories, achieving about 71% coverage on CHILDES, and applies this to BabyLM data. The authors find no strong age-based syntactic differentiation in CHILDES, and show that training on syntactically categorized data can match a baseline with about 40% fewer training steps, with mixed results on cross-construction generalization and some gains on reading-related tasks. Overall, the results suggest that data quality and targeted syntactic signal may be more impactful than curriculum design alone, and they provide an open-source toolkit to enable further, more controlled investigations into syntactic curriculum learning for language models.

Abstract

We examine the syntactic properties of BabyLM corpus, and age-groups within CHILDES. While we find that CHILDES does not exhibit strong syntactic differentiation by age, we show that the syntactic knowledge about the training data can be helpful in interpreting model performance on linguistic tasks. For curriculum learning, we explore developmental and several alternative cognitively inspired curriculum approaches. We find that some curricula help with reading tasks, but the main performance improvement come from using the subset of syntactically categorizable data, rather than the full noisy corpus.

Do Syntactic Categories Help in Developmentally Motivated Curriculum Learning for Language Models?

TL;DR

This work investigates whether developmentally motivated curriculum learning can leverage syntactic information to improve language-model training. It introduces a syntax-based labeling toolkit using ~300 Tregex patterns to categorize sentences into 13 Grambank-aligned categories, achieving about 71% coverage on CHILDES, and applies this to BabyLM data. The authors find no strong age-based syntactic differentiation in CHILDES, and show that training on syntactically categorized data can match a baseline with about 40% fewer training steps, with mixed results on cross-construction generalization and some gains on reading-related tasks. Overall, the results suggest that data quality and targeted syntactic signal may be more impactful than curriculum design alone, and they provide an open-source toolkit to enable further, more controlled investigations into syntactic curriculum learning for language models.

Abstract

We examine the syntactic properties of BabyLM corpus, and age-groups within CHILDES. While we find that CHILDES does not exhibit strong syntactic differentiation by age, we show that the syntactic knowledge about the training data can be helpful in interpreting model performance on linguistic tasks. For curriculum learning, we explore developmental and several alternative cognitively inspired curriculum approaches. We find that some curricula help with reading tasks, but the main performance improvement come from using the subset of syntactically categorizable data, rather than the full noisy corpus.

Paper Structure

This paper contains 19 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Constituency parse of the sentence "My feet are dry because I have boots."
  • Figure 2: Tregex Patterns needed to match the sentence "My feet are dry because I have boots."
  • Figure 4: Distribution of macro-categories across corpora. Y-axis shows the percentage of sentences in each macro-category relative to the total number of sentences in the corpus.
  • Figure 5: Distribution of macro-categories across age-ordered CHILDES. X-axis: age groups; Y-axis: percentage of sentences per macro-category.
  • Figure 6: Cross-subset validation perplexity heatmap. Rows = training subset; columns = evaluation subset. Abbreviations: S=SVX, M=Modifiers, V=Verbal, E=Embedded, I=Infinitives, L=Coordination, R=Relative, Q=Question. Cell values are validation perplexities (lower is better).
  • ...and 1 more figures