Counterfactual Token Generation in Large Language Models

Ivi Chatzi; Nina Corvelo Benz; Eleni Straitouri; Stratis Tsirtsis; Manuel Gomez-Rodriguez

Counterfactual Token Generation in Large Language Models

Ivi Chatzi, Nina Corvelo Benz, Eleni Straitouri, Stratis Tsirtsis, Manuel Gomez-Rodriguez

TL;DR

The paper addresses the lack of counterfactual reasoning in state-of-the-art LLMs by introducing a causal model of token generation based on the Gumbel-Max SCM. This augmentation enables counterfactual token generation at near-zero extra cost, without fine-tuning or prompting, and supports interventions on the generated sequence while reusing the same noise realization. The authors formalize the token-generation process as an SCM, demonstrate an efficient sampling procedure, and validate the approach through qualitative narratives, quantitative similarity analyses, and bias-detection experiments across two open-weight LLMs. The work provides a practical tool for model introspection and fairness evaluation, with implications for safer human–AI interaction and further exploration of model-world representations. Overall, the method offers a principled, accessible way to probe and understand the causal structure embedded in LLMs' outputs.

Abstract

"Sure, I am happy to generate a story for you: Captain Lyra stood at the helm of her trusty ship, the Maelstrom's Fury, gazing out at the endless sea. [...] Lyra's eyes welled up with tears as she realized the bitter truth - she had sacrificed everything for fleeting riches, and lost the love of her crew, her family, and herself." Although this story, generated by a large language model, is captivating, one may wonder -- how would the story have unfolded if the model had chosen "Captain Maeve" as the protagonist instead? We cannot know. State-of-the-art large language models are stateless -- they maintain no internal memory or state. Given a prompt, they generate a sequence of tokens as an output using an autoregressive process. As a consequence, they cannot reason about counterfactual alternatives to tokens they have generated in the past. In this work, our goal is to enhance them with this functionality. To this end, we develop a causal model of token generation that builds upon the Gumbel-Max structural causal model. Our model allows any large language model to perform counterfactual token generation at almost no cost in comparison with vanilla token generation, it is embarrassingly simple to implement, and it does not require any fine-tuning nor prompt engineering. We implement our model on Llama 3 8B-Instruct and Ministral-8B-Instruct and conduct a qualitative and a quantitative analysis of counterfactually generated text. We conclude with a demonstrative application of counterfactual token generation for bias detection, unveiling interesting insights about the model of the world constructed by large language models.

Counterfactual Token Generation in Large Language Models

TL;DR

Abstract

Paper Structure (13 sections, 5 equations, 13 figures, 3 tables, 1 algorithm)

This paper contains 13 sections, 5 equations, 13 figures, 3 tables, 1 algorithm.

Introduction
A Causal Model of Token Generation
Counterfactual Token Generation Using Gumbel-Max SCMs
Experiments
How would the story have unfolded for "Captain Maeve"?
How similar is counterfactually generated text to the factual one?
Does counterfactual token generation reveal model biases?
Discussion and Limitations
Conclusions
Additional counterfactual stories
Additional details on the experimental setup of Section \ref{['sec:bias']}
Additional experimental results on bias detection
Additional experiments using a sampler that does not satisfy counterfactual stability

Figures (13)

Figure 1: Illustrative examples of autoregressive token generation. In all panels, plain text indicates the input provided to the LLM and highlighted text indicates the output generated by the model. Each token in the output sequence is highlighted in a different color to represent the (stochastic) state of the sampler. Panel (a) shows an LLM's output to a user's prompt using vanilla autoregressive token generation. Panels (b, c) show an LLM's output to an input comprising a user's prompt and an unmodified/modified part of the original output from Panel (a) using vanilla autoregressive token generation. Panel (d) shows an LLM's counterfactual output to an input comprising a user's prompt and a modified part of the output from Panel (a) using autoregressive token generation augmented with the Gumbel-Max SCM.
Figure 2: Causal graph of our proposed SCM $\mathcal{M}$ for token generation. Boxes represent endogenous random variables and circles represent exogenous random (noise) variables. The value of each endogenous variable is given by a function of the values of its ancestors in the causal graph, as defined by Eq. \ref{['eq:SCM']}. The value of each noise variable $U_i$ is sampled independently from a given distribution $P_U$, and it determines the stochastic state of the LLM's sampler during the generation of token $T_i$ (refer to Fig. \ref{['fig:example']}).
Figure 3: Examples of factual, interventional and counterfactual stories. Panel (a) shows a factual story, as given by Llama 3 8B-Instruct. Panels (b) and (c) show variants of the story resulting from in-ter-ven-tional and counterfactual token generation, respectively. In panels (b), (c), we give as input to the LLM the original prompt along with the first sentence of the factual output (non-highlighted text), modified by replacing "Lyra" with "Maeve". Blue (green)-highlighted text indicates the tokens of the output that are identical in the factual story and its interventional (counterfactual) counterpart. Red-highlighted text indicates the differences. In both panels, the temperature parameter is set to $\tau=0.9$.
Figure 4: Comparison between interventional and counterfactual token generation. The panels show the edit distance between the factual token sequence and the sequence generated by interventional and counterfactual token generation using (a) the Gumbel-Max SCM defined in Equation \ref{['eq:sampling-mechanism']}, (b) the top-$p$ Gumbel-Max SCM, and (c) the top-$k$ Gumbel-Max SCM discussed at the end of Section \ref{['sec:counterfactual']}, against various values of the temperature parameter $\tau$, $p$ and $k$, respectively. In panels (b, c) the temperature parameter is set to $\tau = 0.6$. In all three panels, the edit distance is averaged over $4{,}000$ output sequences, resulting from two independent interventions per factual sequence, and shaded areas represent $95\%$ confidence intervals. The icons and indicate results for Llama 3 8B-Instruct and Ministral-8B-Instruct respectively.
Figure 5: Comparison between factual and counterfactual income. Panel (a) shows the change in income of male (female) individuals had they been female (male), while keeping fixed the rest of their attributes preceding income in the output sequence. Panel (b) shows the change in income of male (female) individuals had they been female (male), while keeping fixed the attributes preceding sex but allowing the attributes between sex and income to change in the output sequence. Panel (c) shows the factual distributions of income of female and male individuals and the counterfactual distribution of income of female individuals under the same intervention as in panel (b). Enlarged points in panels (a, b) and dashed lines in panel (c) correspond to the median income. In all panels, we use Llama 3 8B-Instruct and set the temperature parameter to $\tau=0.8$.
...and 8 more figures

Counterfactual Token Generation in Large Language Models

TL;DR

Abstract

Counterfactual Token Generation in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (13)