Deep networks learn to parse uniform-depth context-free languages from local statistics

Jack T. Parley; Francesco Cagnetta; Matthieu Wyart

Deep networks learn to parse uniform-depth context-free languages from local statistics

Jack T. Parley, Francesco Cagnetta, Matthieu Wyart

TL;DR

This work investigates how deep networks acquire hierarchical parse representations from local sentence statistics using a tunable varying-tree Random Hierarchy Model (RHM) of uniform-depth PCFGs. It introduces a moments-based learning algorithm that links learnability and sample complexity to language statistics via root-to-pair and root-to-triple covariances, and demonstrates a phase-transition-like behavior in global ambiguity with a critical point $f_c=3/8$. The authors derive a finite-sample complexity bound $P^*=O((p^2_2/2)^{1-L} v m_3 m_2^{L-1})$, and validate predictions across CNNs, INN, and transformer architectures, showing robust scaling and accurate recovery of grammar rules in the low-ambiguity regime. The study provides a principled explanation for how deep nets extract abstract, syntax-invariant representations from locally correlated signals, with implications for understanding next-token prediction and the data requirements for learning hierarchical structure.

Abstract

Understanding how the structure of language can be learned from sentences alone is a central question in both cognitive science and machine learning. Studies of the internal representations of Large Language Models (LLMs) support their ability to parse text when predicting the next word, while representing semantic notions independently of surface form. Yet, which data statistics make these feats possible, and how much data is required, remain largely unknown. Probabilistic context-free grammars (PCFGs) provide a tractable testbed for studying these questions. However, prior work has focused either on the post-hoc characterization of the parsing-like algorithms used by trained networks; or on the learnability of PCFGs with fixed syntax, where parsing is unnecessary. Here, we (i) introduce a tunable class of PCFGs in which both the degree of ambiguity and the correlation structure across scales can be controlled; (ii) provide a learning mechanism -- an inference algorithm inspired by the structure of deep convolutional networks -- that links learnability and sample complexity to specific language statistics; and (iii) validate our predictions empirically across deep convolutional and transformer-based architectures. Overall, we propose a unifying framework where correlations at different scales lift local ambiguities, enabling the emergence of hierarchical representations of the data.

Deep networks learn to parse uniform-depth context-free languages from local statistics

TL;DR

. The authors derive a finite-sample complexity bound

, and validate predictions across CNNs, INN, and transformer architectures, showing robust scaling and accurate recovery of grammar rules in the low-ambiguity regime. The study provides a principled explanation for how deep nets extract abstract, syntax-invariant representations from locally correlated signals, with implications for understanding next-token prediction and the data requirements for learning hierarchical structure.

Abstract

Paper Structure (49 sections, 2 theorems, 97 equations, 24 figures)

This paper contains 49 sections, 2 theorems, 97 equations, 24 figures.

Introduction
Our contribution
Previous works
Varying-tree Random Hierarchy Model
Generative Model
Controlling the level of global ambiguity
Learning Algorithm and Sample Complexity
Root-to-pair covariance determines binary rules
Whitened root-to-triple covariance determines ternary rules
Candidate nonterminals and covariance with the root
Sample Complexity
Empirical measurements of $P^*$
Tests on CNNs
Other architectures
Conclusion
...and 34 more sections

Key Result

Lemma 3.1

in the limit $v\to\infty$ with $f$ fixed and $m_s\,{=}\,fv^{s-1}$ for $s\,{=}\,2,3$, the marginal probability of a single hidden (nonterminal) or visible (terminal) symbol converges to the uniform probability $1/v$.

Figures (24)

Figure 1: (a) Test cross-entropy loss $\overline{\mathcal{L}}$ (normalized by random guessing) of deep CNNs as a function of the number of training sentences $P$, generated from a varying-tree PCFG of uniform depth $L$ and vocabulary size $v$. (b) Remarkably, curves collapse when $P$ is rescaled by our theoretical prediction for sample complexity. $m_2$ and $m_3$ characterize the number of production rules (see main text), here $m_2$ is constant and $m_3\propto v$.
Figure 2: Left: local vs. global ambiguity (both lexical and syntactic) in natural language. In this "garden-path" sentence frazier_rayner_1982_gardenpath, the reader is pushed by habit into reading the complex houses as a noun phrase (NP); pursuing this, however, leads to a dead end (an incomplete parse) so that the sentence remains globally unambiguous. Right: example of global ambiguity in a varying-tree RHM of depth $L=2$, where the sentence "efddf" has two associated parse trees/class labels. Shown below are the production rules used in each case to derive the sentence.
Figure 3: Crossover from local to global ambiguity as the fraction $f$ of grammatical rules is tuned. Shown is the expected class entropy given a sentence $\mathop{\mathbb{E}}[H_L(\alpha|x)]$ (normalized by $\ln (v)$), which we compute numerically by running the parallelized inside algorithm and averaging over both sentences and grammar realizations. Red dashed lines indicate the theoretical estimate $f_c{=} 3/8$ (App. \ref{['app:bottom_up']}), while black dashed lines show $\mathop{\mathbb{E}}[H_{L=2}(\alpha|x)]$, which can be computed exactly (App. \ref{['app:top_down']}).
Figure 4: Illustration of the inside algorithm and the appearance of spurious nonterminals. Depicted are the inside tensors $M^{(l)}_{i,\lambda}(z)$ created while parsing a sentence of length $d{=}11$ and depth $L{=}3$. The third axis corresponding to nonterminals is not shown explicitly; instead we only distinguish between nonterminals belonging to the true parse tree (circles) and spurious ones (crosses). Spurious candidates can be built from purely spurious/true nonterminals in the level below (green/purple) or from a mixture (orange).
Figure 5: Decomposition of root-to-pair and root-to-triple covariances by the law of total covariance according to the local tolopogy. Diagrams i) to iv) contribute to the root-to-pair covariance, while diagrams v), vi) and vii) to the root-to-triple one.
...and 19 more figures

Theorems & Definitions (2)

Lemma 3.1: Asymptotic symbol marginals
Lemma 3.2: Asymptotic root-to-terminal covariance

Deep networks learn to parse uniform-depth context-free languages from local statistics

TL;DR

Abstract

Deep networks learn to parse uniform-depth context-free languages from local statistics

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (24)

Theorems & Definitions (2)