Table of Contents
Fetching ...

Gumbel Counterfactual Generation From Language Models

Shauli Ravfogel, Anej Svete, Vésteinn Snæbjarnarson, Ryan Cotterell

TL;DR

The paper addresses the need to rigorously define and generate true counterfactuals for language models, distinguishing counterfactual reasoning from interventions within Pearl's causal hierarchy. It introduces a Gumbel-max based well-founded SEM (WSEM) framework that represents language generation as a deterministic computation driven by exogenous noise, enabling joint sampling of original strings and their counterfactuals. Central to the method is the Hindsight Gumbel Sampling algorithm, which infers latent noise from an observed sentence to produce counterfactual continuations under a given intervention, with formalism ensuring counterfactual stability under a Thurstone-inspired sampling process. Empirically, the authors show that standard interventions (e.g., MEMIT knowledge edits, linear steering, and instruction tuning) induce non-targeted side effects, challenging the goal of surgical, minimally invasive modifications. The work provides a principled pathway to analyze causal effects at the highest level of the causal hierarchy and motivates future exploration of counterfactually stable mechanisms to improve controllability and safety in LM generation.

Abstract

Understanding and manipulating the causal generation mechanisms in language models is essential for controlling their behavior. Previous work has primarily relied on techniques such as representation surgery -- e.g., model ablations or manipulation of linear subspaces tied to specific concepts -- to \emph{intervene} on these models. To understand the impact of interventions precisely, it is useful to examine \emph{counterfactuals} -- e.g., how a given sentence would have appeared had it been generated by the model following a specific intervention. We highlight that counterfactual reasoning is conceptually distinct from interventions, as articulated in Pearl's causal hierarchy. Based on this observation, we propose a framework for generating true string counterfactuals by reformulating language models as a structural equation model using the Gumbel-max trick, which we called Gumbel counterfactual generation. This reformulation allows us to model the joint distribution over original strings and their counterfactuals resulting from the same instantiation of the sampling noise. We develop an algorithm based on hindsight Gumbel sampling that allows us to infer the latent noise variables and generate counterfactuals of observed strings. Our experiments demonstrate that the approach produces meaningful counterfactuals while at the same time showing that commonly used intervention techniques have considerable undesired side effects.

Gumbel Counterfactual Generation From Language Models

TL;DR

The paper addresses the need to rigorously define and generate true counterfactuals for language models, distinguishing counterfactual reasoning from interventions within Pearl's causal hierarchy. It introduces a Gumbel-max based well-founded SEM (WSEM) framework that represents language generation as a deterministic computation driven by exogenous noise, enabling joint sampling of original strings and their counterfactuals. Central to the method is the Hindsight Gumbel Sampling algorithm, which infers latent noise from an observed sentence to produce counterfactual continuations under a given intervention, with formalism ensuring counterfactual stability under a Thurstone-inspired sampling process. Empirically, the authors show that standard interventions (e.g., MEMIT knowledge edits, linear steering, and instruction tuning) induce non-targeted side effects, challenging the goal of surgical, minimally invasive modifications. The work provides a principled pathway to analyze causal effects at the highest level of the causal hierarchy and motivates future exploration of counterfactually stable mechanisms to improve controllability and safety in LM generation.

Abstract

Understanding and manipulating the causal generation mechanisms in language models is essential for controlling their behavior. Previous work has primarily relied on techniques such as representation surgery -- e.g., model ablations or manipulation of linear subspaces tied to specific concepts -- to \emph{intervene} on these models. To understand the impact of interventions precisely, it is useful to examine \emph{counterfactuals} -- e.g., how a given sentence would have appeared had it been generated by the model following a specific intervention. We highlight that counterfactual reasoning is conceptually distinct from interventions, as articulated in Pearl's causal hierarchy. Based on this observation, we propose a framework for generating true string counterfactuals by reformulating language models as a structural equation model using the Gumbel-max trick, which we called Gumbel counterfactual generation. This reformulation allows us to model the joint distribution over original strings and their counterfactuals resulting from the same instantiation of the sampling noise. We develop an algorithm based on hindsight Gumbel sampling that allows us to infer the latent noise variables and generate counterfactuals of observed strings. Our experiments demonstrate that the approach produces meaningful counterfactuals while at the same time showing that commonly used intervention techniques have considerable undesired side effects.

Paper Structure

This paper contains 29 sections, 6 theorems, 18 equations, 5 figures, 1 algorithm.

Key Result

Theorem 2.1

Let ${\mathrm{X}}$ be a categorical RV over $\{1, \ldots, M\}$ such that for $m \in {{\left\{ 1, \ldots, M \right\}}}$ and a vector ${{{ \boldsymbol{\phi}}}}\in \mathbb{R}^{M}$.This, naturally, assumes that none of the probabilities are $0$, which is a common assumption both in language modeling as well as in decision theory YELLOTT1977109cotterell2024formal. Then, for $ where $\stackrel{d}{=}$ r

Figures (5)

  • Figure 1: A language process as a WSEM.
  • Figure 2: Normalized edit distance between the original and counterfactual sentences, for different intervention techniques. The horizontal lines denote the median of each distribution.
  • Figure 3: Counterfactual strings from the original model LLaMA3 and the counterfactual counterpart LLaMA3-Instruct.
  • Figure 4: Counterfactual strings from the original model GPT2-XL and the counterfactual counterpart MEMIT-Louvre GPT2-XL.
  • Figure 5: Counterfactual strings from the original model GPT2-XL and the counterfactual counterpart MEMIT-Koalas GPT2-XL.

Theorems & Definitions (17)

  • Definition 2.1: Acyclic SEM
  • Definition 2.2: Intervention
  • Definition 2.3: Counterfactual Distribution
  • Definition 2.4: Well-founded Order
  • Definition 2.5: Well-founded SEM
  • Definition 2.6: Gumbel distribution
  • Theorem 2.1
  • proof
  • Proposition 3.1: Hindsight Gumbel Sampling
  • Corollary 3.1: Counterfactual String Sampling
  • ...and 7 more