Table of Contents
Fetching ...

Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations

Ananth Agarwal, Jasper Jian, Christopher D. Manning, Shikhar Murty

TL;DR

This work interrogates whether syntactic information revealed by linear probing truly explains a model's downstream syntactic behavior. By evaluating 32 open-weight transformer models with three syntax probes and a control, and by testing against BLiMP minimal-pair judgments, the authors find a persistent dissociation: probing accuracy does not reliably predict targets like subject–verb agreement or filler–gap performance. The study introduces a control task to isolate lexical signals and reveals that even non-syntactic probes can correlate with some syntactic benchmarks, urging caution in interpreting probing results as explanations of model behavior. The results advocate using external targeted evaluations (BLiMP-style tasks) as the gold standard for syntactic competence and encourage multilingual extensions to understand language-specific probing dynamics.

Abstract

Large Language Models (LLMs) exhibit a robust mastery of syntax when processing and generating text. While this suggests internalized understanding of hierarchical syntax and dependency relations, the precise mechanism by which they represent syntactic structure is an open area within interpretability research. Probing provides one way to identify the mechanism of syntax being linearly encoded in activations, however, no comprehensive study has yet established whether a model's probing accuracy reliably predicts its downstream syntactic performance. Adopting a "mechanisms vs. outcomes" framework, we evaluate 32 open-weight transformer models and find that syntactic features extracted via probing fail to predict outcomes of targeted syntax evaluations across English linguistic phenomena. Our results highlight a substantial disconnect between latent syntactic representations found via probing and observable syntactic behaviors in downstream tasks.

Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations

TL;DR

This work interrogates whether syntactic information revealed by linear probing truly explains a model's downstream syntactic behavior. By evaluating 32 open-weight transformer models with three syntax probes and a control, and by testing against BLiMP minimal-pair judgments, the authors find a persistent dissociation: probing accuracy does not reliably predict targets like subject–verb agreement or filler–gap performance. The study introduces a control task to isolate lexical signals and reveals that even non-syntactic probes can correlate with some syntactic benchmarks, urging caution in interpreting probing results as explanations of model behavior. The results advocate using external targeted evaluations (BLiMP-style tasks) as the gold standard for syntactic competence and encourage multilingual extensions to understand language-specific probing dynamics.

Abstract

Large Language Models (LLMs) exhibit a robust mastery of syntax when processing and generating text. While this suggests internalized understanding of hierarchical syntax and dependency relations, the precise mechanism by which they represent syntactic structure is an open area within interpretability research. Probing provides one way to identify the mechanism of syntax being linearly encoded in activations, however, no comprehensive study has yet established whether a model's probing accuracy reliably predicts its downstream syntactic performance. Adopting a "mechanisms vs. outcomes" framework, we evaluate 32 open-weight transformer models and find that syntactic features extracted via probing fail to predict outcomes of targeted syntax evaluations across English linguistic phenomena. Our results highlight a substantial disconnect between latent syntactic representations found via probing and observable syntactic behaviors in downstream tasks.

Paper Structure

This paper contains 40 sections, 11 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Mechanisms vs. outcomes setup. We find no convincing predictive power of syntax probing accuracy on downstream syntactic evaluation accuracy. (a) Mechanism: probe $g_\phi$ extracts a dependency parse tree from $S_{acc}$ word-level hidden states $h_i$; UUAS = 4/6 edges in this toy example. (b) Outcome: evaluate minimal pair accuracy.
  • Figure 2: Penn Treebank test set $g^\text{struct}_{\phi}$ UUAS for each layer of a sample of our models. The star icon for a model indicates the layer with the best test set accuracy that is used for BLiMP evaluation. For most models, this occurs in the first half. Our results for GPT-2 124M and GPT-2 1.6B closely match those of eisape2022probingincrementalparsestates. For T5 3B, although most layers yield low UUAS, the best layer (13) still exceeds 0.7.
  • Figure 3: Control probes consistently erase contextual information in hidden states, as evidenced by near-zero variance of word contextual hidden states in the projected representation space. Control probes shown here are trained on structural probe best layers.
  • Figure 4: Simple regression plots for $g^\text{struct}_{\phi}$ (top row), $g^\text{ortho}_{\phi}$ (middle row), and $g^\text{head}_{\phi}$ (bottom row). Each panel is annotated with adjusted $R^2$ and the $p$-value of $\beta_1$. Per-phenomenon results have Holm-Bonferroni correction. The first column of panels shows that at the full dataset granularity, no probe explains the spread in minimal pairs accuracy with any statistical significance. At per-phenomenon granularity, the second column contains the phenomenon with the lowest $p$-value per probe. We additionally highlight subject--verb agreement (third column) and filler--gap (fourth column) as strongly syntactic tasks with critical edges that we identify in Appendix \ref{['sec:critical-edges']}.
  • Figure 5: Control $g^\text{ctrl}_{\phi}$ (trained on $g^\text{struct}_{\phi}$ best layers) unexpectedly achieves statistical significance for predicting the BLiMP irregular forms phenomenon, which is a morphosyntactic task. Tables \ref{['tab:simple_ctrl_regression']} and \ref{['tab:simple_ctrl_regression_head']} in Appendix \ref{['sec:full-tables']} have full control simple regression results.
  • ...and 11 more figures