CausalGym: Benchmarking causal interpretability methods on linguistic tasks

Aryaman Arora; Dan Jurafsky; Christopher Potts

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

Aryaman Arora, Dan Jurafsky, Christopher Potts

TL;DR

CausalGym presents a scalable benchmark that repurposes SyntaxGym tasks to evaluate the causal efficacy of interpretability methods on linguistic behaviours in language models. By applying 1D interchange interventions across residual streams and comparing seven feature-finding methods (notably DAS), the work demonstrates that DAS most effectively induces intended behavioural changes, though control tasks reveal its advantages may arise from access to downstream outputs. The authors show that two difficult linguistic phenomena—negative polarity item licensing and filler–gap dependencies—emerge during training in discrete, multi-step stages rather than gradually, offering mechanistic insights into how LMs learn complex syntax. Overall, CausalGym provides a rigorous framework for causal evaluation of interpretability methods and encourages broader adoption of interventional approaches in computational psycholinguistics.

Abstract

Language models (LMs) have proven to be powerful tools for psycholinguistic research, but most prior work has focused on purely behavioural measures (e.g., surprisal comparisons). At the same time, research in model interpretability has begun to illuminate the abstract causal mechanisms shaping LM behavior. To help bring these strands of research closer together, we introduce CausalGym. We adapt and expand the SyntaxGym suite of tasks to benchmark the ability of interpretability methods to causally affect model behaviour. To illustrate how CausalGym can be used, we study the pythia models (14M--6.9B) and assess the causal efficacy of a wide range of interpretability methods, including linear probing and distributed alignment search (DAS). We find that DAS outperforms the other methods, and so we use it to study the learning trajectory of two difficult linguistic phenomena in pythia-1b: negative polarity item licensing and filler--gap dependencies. Our analysis shows that the mechanism implementing both of these tasks is learned in discrete stages, not gradually.

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

TL;DR

Abstract

Paper Structure (39 sections, 12 equations, 11 figures, 20 tables)

This paper contains 39 sections, 12 equations, 11 figures, 20 tables.

Introduction
Related work
Targeted syntactic evaluation.
Interventional interpretability.
Benchmark
Premise
Templatising SyntaxGym
Tasks
Evaluation
Methods
Preliminaries
Definitions
DAS.
Linear probe.
Diff-in-means.
...and 24 more sections

Figures (11)

Figure 1: The CausalGym pipeline: (1) take an input minimal pair ($\mathbf{b}, \mathbf{s}$) exhibiting a linguistic alternation that affects next-token predictions ($y_b, y_s$); (2) intervene on the base forward pass using a pre-defined intervention function that operates on aligned representations from both inputs; (3) check how this intervention affected the next-token prediction probabilities. In aggregate, such interventions assess the causal role of the intervened representation on the model's behaviour.
Figure 2: An example of the CausalGym conversion process on the test suite Subject-Verb Number Agreement (with prepositional phrase). The left side shows how items are structured in SyntaxGym originally, which we process into the templatic format on the right. The bottom shows how we sample a minimal pair.
Figure 3: Accuracy of pythia-family models on the CausalGym tasks, grouped by type, with scale. The dashed line is random-chance accuracy (50%).
Figure 4: Odds-ratio for checkpoints of pythia-1b on the task npi_any_subj_relc, plotted at every layer and template region. The $y$-axis is labelled with an example pair of sentences. The plot titles are labelled with the checkpoint and task accuracy. Darker regions indicate a token in a specific layer where causal effect was high.
Figure 5: Odds-ratio for each layer and region using DAS and probing on pythia-1b, on two tasks.
...and 6 more figures

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

TL;DR

Abstract

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (11)