Table of Contents
Fetching ...

Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models

Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen, Yonatan Belinkov

TL;DR

This work examines how neural language models implement subject-verb agreement by applying causal mediation analysis to Transformer-based architectures. By treating neurons as mediators and using input interventions, the study identifies two distinct agreement mechanisms in GPT-2 and Transformer-XL, and a more unified mechanism in XLNet, with larger models not necessarily producing larger agreement margins. It further shows that the most influential neurons for agreement are shared across similar syntactic structures, and that NIE patterns vary with structure and layer, implying distributed, architecture-dependent syntax representations. These findings advance interpretability by linking mechanistic, neuron-level mediators to syntactic behavior and highlight implications for model design and analysis across architectures.

Abstract

Targeted syntactic evaluations have demonstrated the ability of language models to perform subject-verb agreement given difficult contexts. To elucidate the mechanisms by which the models accomplish this behavior, this study applies causal mediation analysis to pre-trained neural language models. We investigate the magnitude of models' preferences for grammatical inflections, as well as whether neurons process subject-verb agreement similarly across sentences with different syntactic structures. We uncover similarities and differences across architectures and model sizes -- notably, that larger models do not necessarily learn stronger preferences. We also observe two distinct mechanisms for producing subject-verb agreement depending on the syntactic structure of the input sentence. Finally, we find that language models rely on similar sets of neurons when given sentences with similar syntactic structure.

Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models

TL;DR

This work examines how neural language models implement subject-verb agreement by applying causal mediation analysis to Transformer-based architectures. By treating neurons as mediators and using input interventions, the study identifies two distinct agreement mechanisms in GPT-2 and Transformer-XL, and a more unified mechanism in XLNet, with larger models not necessarily producing larger agreement margins. It further shows that the most influential neurons for agreement are shared across similar syntactic structures, and that NIE patterns vary with structure and layer, implying distributed, architecture-dependent syntax representations. These findings advance interpretability by linking mechanistic, neuron-level mediators to syntactic behavior and highlight implications for model design and analysis across architectures.

Abstract

Targeted syntactic evaluations have demonstrated the ability of language models to perform subject-verb agreement given difficult contexts. To elucidate the mechanisms by which the models accomplish this behavior, this study applies causal mediation analysis to pre-trained neural language models. We investigate the magnitude of models' preferences for grammatical inflections, as well as whether neurons process subject-verb agreement similarly across sentences with different syntactic structures. We uncover similarities and differences across architectures and model sizes -- notably, that larger models do not necessarily learn stronger preferences. We also observe two distinct mechanisms for producing subject-verb agreement depending on the syntactic structure of the input sentence. Finally, we find that language models rely on similar sets of neurons when given sentences with similar syntactic structure.

Paper Structure

This paper contains 24 sections, 5 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Syntactic structures used in this study. Ungrammatical forms are marked with asterisks. Target subjects and their agreeing verb inflections are shown in blue, while attractors and their agreeing inflections are shown in red.
  • Figure 2: Total effects are measured by performing an intervention on the prompt (here, changing the grammatical number of the main subject), and measuring the relative change in the response variable (the ratio of probabilities of the originally incorrect verb form over the originally correct verb form).
  • Figure 3: Total effects for each structure by model size for GPT-2. Adverbial distractors increase total effects, while attractor phrases decrease them.
  • Figure 4: Grammaticality for each structure for GPT-2 Medium. The subject number (indicated by bar color) refers to the grammatical number of the subject with which the target verb agrees; the number in the structure name refers to the grammatical number of the attractor (in structures where attractors are present).
  • Figure 5: Indirect effects are measured by setting an individual neuron to the value it would have taken had the intervention occurred, then measuring the relative change in the response variable.
  • ...and 12 more figures