Table of Contents
Fetching ...

Unraveling Syntax: How Language Models Learn Context-Free Grammars

Laura Ying Schulz, Daniel Mitropolsky, Tomaso Poggio

TL;DR

The paper addresses how language models learn syntax by using synthetic PCFGs as a controlled testbed and derives a theoretical framework in which the training loss $D_{KL}(P_G \| Q_ heta)$ decomposes additively over subgrammars. It shows empirically that subgrammars decrease in parallel during training, rather than in a strict developmental order, and demonstrates that pretraining on subgrammars can aid small models while larger models are less affected. Activation-space analyses via Centered Kernel Alignment (CKA) indicate pretrained models develop more structured representations of subgrammar structure. The work highlights fundamental challenges in representing deep recursive syntax and introduces an open-source CFG-based learning framework to probe and extend the theory of language-learning dynamics in neural networks.

Abstract

We introduce a new framework for understanding how language models acquire syntax. While large models achieve impressive results, little is known about their learning dynamics. Our approach starts with the observation that most domains of interest, such as natural language syntax, coding languages, arithmetic problems, are captured by probabilistic context-free grammars (PCFGs). We study the learning dynamics of small models trained on synthetic languages generated from PCFGs, enabling precise control over grammar complexity, recursion depth, and subgrammar structure. We prove several general, recursive formulae for the training loss and Kullback-Leibler divergence over the subgrammar structure of a PCFG. Empirically, we find that unlike children, who first master simple substructures before progressing to more complex constructions, transformers reduce loss across all subgrammars in parallel. We further show that subgrammar pretraining can improve the final loss for smaller models, and that pretrained models develop internal representations more aligned with the grammar's substructure. Finally, we demonstrate that models struggle with deeper recursive structures (a limitation even of large language models), revealing fundamental challenges in how neural networks represent hierarchical syntax. Overall, our work initiates the study of the learning dynamics of transformers on PCFGs as a versatile testbed for probing learning in language models, opening a research direction with many open questions.

Unraveling Syntax: How Language Models Learn Context-Free Grammars

TL;DR

The paper addresses how language models learn syntax by using synthetic PCFGs as a controlled testbed and derives a theoretical framework in which the training loss decomposes additively over subgrammars. It shows empirically that subgrammars decrease in parallel during training, rather than in a strict developmental order, and demonstrates that pretraining on subgrammars can aid small models while larger models are less affected. Activation-space analyses via Centered Kernel Alignment (CKA) indicate pretrained models develop more structured representations of subgrammar structure. The work highlights fundamental challenges in representing deep recursive syntax and introduces an open-source CFG-based learning framework to probe and extend the theory of language-learning dynamics in neural networks.

Abstract

We introduce a new framework for understanding how language models acquire syntax. While large models achieve impressive results, little is known about their learning dynamics. Our approach starts with the observation that most domains of interest, such as natural language syntax, coding languages, arithmetic problems, are captured by probabilistic context-free grammars (PCFGs). We study the learning dynamics of small models trained on synthetic languages generated from PCFGs, enabling precise control over grammar complexity, recursion depth, and subgrammar structure. We prove several general, recursive formulae for the training loss and Kullback-Leibler divergence over the subgrammar structure of a PCFG. Empirically, we find that unlike children, who first master simple substructures before progressing to more complex constructions, transformers reduce loss across all subgrammars in parallel. We further show that subgrammar pretraining can improve the final loss for smaller models, and that pretrained models develop internal representations more aligned with the grammar's substructure. Finally, we demonstrate that models struggle with deeper recursive structures (a limitation even of large language models), revealing fundamental challenges in how neural networks represent hierarchical syntax. Overall, our work initiates the study of the learning dynamics of transformers on PCFGs as a versatile testbed for probing learning in language models, opening a research direction with many open questions.

Paper Structure

This paper contains 22 sections, 8 theorems, 26 equations, 7 figures, 2 tables.

Key Result

Proposition 3.9

Given a true distribution $P$ and model $Q_\theta$ parametrized by $\theta$,

Figures (7)

  • Figure 1: KL-divergence decomposition in a two-layer Transformer. Grammar definitions are given in the appendix.
  • Figure 2: Two examples of the learning behavior of subgrammars.
  • Figure 3: LM error vs. longer context, with or without recursion
  • Figure 4: Two-layer Transformer showing the impact of the probability of recursion.
  • Figure 5: Examples of pretraining on differently placed subgrammars using ABC Grammar.
  • ...and 2 more figures

Theorems & Definitions (22)

  • Definition 3.1: CFG
  • Definition 3.2: PCFG
  • Definition 3.3: Inner Subgrammar
  • Definition 3.4: Proper Subgrammar
  • Definition 3.5: Outer Subgrammar
  • Definition 3.6: Kullback-Leibler Divergence
  • Definition 3.7: Maximum Likelihood Estimation
  • Definition 3.8: Shannon Entropy
  • Proposition 3.9
  • Theorem 4.1: Unique decomposition of PCFG into inner subgrammars
  • ...and 12 more