Table of Contents
Fetching ...

Executable Counterfactuals: Improving LLMs' Causal Reasoning Through Code

Aniket Vashishtha, Qirun Dai, Hongyuan Mei, Amit Sharma, Chenhao Tan, Hao Peng

TL;DR

The paper introduces executable counterfactuals, a code-based framework that enforces the full Abduction-Intervention-Prediction cycle to evaluate LLMs' counterfactual reasoning. By mapping causal reasoning to executable Python templates and GSM-style math problems, the authors create controllable training and evaluation data with explicit latent variables and varying causal structures. Empirical results show a substantial drop from interventional to counterfactual performance for strong models, and that standard SFT generalizes poorly to out-of-distribution tasks, whereas RLVR provides robust, cross-domain generalization and gains on both code and math reasoning. The findings advocate for reinforcement-learning-based approaches over supervised finetuning to instill core counterfactual reasoning skills, and highlight the framework's potential for broader causal-skill evaluation and training in LLMs.

Abstract

Counterfactual reasoning, a hallmark of intelligence, consists of three steps: inferring latent variables from observations (abduction), constructing alternatives (interventions), and predicting their outcomes (prediction). This skill is essential for advancing LLMs' causal understanding and expanding their applications in high-stakes domains such as scientific research. However, existing efforts in assessing LLM's counterfactual reasoning capabilities tend to skip the abduction step, effectively reducing to interventional reasoning and leading to overestimation of LLM performance. To address this, we introduce executable counterfactuals, a novel framework that operationalizes causal reasoning through code and math problems. Our framework explicitly requires all three steps of counterfactual reasoning and enables scalable synthetic data creation with varying difficulty, creating a frontier for evaluating and improving LLM's reasoning. Our results reveal substantial drop in accuracy (25-40%) from interventional to counterfactual reasoning for SOTA models like o4-mini and Claude-4-Sonnet. To address this gap, we construct a training set comprising counterfactual code problems having if-else condition and test on out-of-domain code structures (e.g. having while-loop); we also test whether a model trained on code would generalize to counterfactual math word problems. While supervised finetuning on stronger models' reasoning traces improves in-domain performance of Qwen models, it leads to a decrease in accuracy on OOD tasks such as counterfactual math problems. In contrast, reinforcement learning induces the core cognitive behaviors and generalizes to new domains, yielding gains over the base model on both code (improvement of 1.5x-2x) and math problems. Analysis of the reasoning traces reinforces these findings and highlights the promise of RL for improving LLMs' counterfactual reasoning.

Executable Counterfactuals: Improving LLMs' Causal Reasoning Through Code

TL;DR

The paper introduces executable counterfactuals, a code-based framework that enforces the full Abduction-Intervention-Prediction cycle to evaluate LLMs' counterfactual reasoning. By mapping causal reasoning to executable Python templates and GSM-style math problems, the authors create controllable training and evaluation data with explicit latent variables and varying causal structures. Empirical results show a substantial drop from interventional to counterfactual performance for strong models, and that standard SFT generalizes poorly to out-of-distribution tasks, whereas RLVR provides robust, cross-domain generalization and gains on both code and math reasoning. The findings advocate for reinforcement-learning-based approaches over supervised finetuning to instill core counterfactual reasoning skills, and highlight the framework's potential for broader causal-skill evaluation and training in LLMs.

Abstract

Counterfactual reasoning, a hallmark of intelligence, consists of three steps: inferring latent variables from observations (abduction), constructing alternatives (interventions), and predicting their outcomes (prediction). This skill is essential for advancing LLMs' causal understanding and expanding their applications in high-stakes domains such as scientific research. However, existing efforts in assessing LLM's counterfactual reasoning capabilities tend to skip the abduction step, effectively reducing to interventional reasoning and leading to overestimation of LLM performance. To address this, we introduce executable counterfactuals, a novel framework that operationalizes causal reasoning through code and math problems. Our framework explicitly requires all three steps of counterfactual reasoning and enables scalable synthetic data creation with varying difficulty, creating a frontier for evaluating and improving LLM's reasoning. Our results reveal substantial drop in accuracy (25-40%) from interventional to counterfactual reasoning for SOTA models like o4-mini and Claude-4-Sonnet. To address this gap, we construct a training set comprising counterfactual code problems having if-else condition and test on out-of-domain code structures (e.g. having while-loop); we also test whether a model trained on code would generalize to counterfactual math word problems. While supervised finetuning on stronger models' reasoning traces improves in-domain performance of Qwen models, it leads to a decrease in accuracy on OOD tasks such as counterfactual math problems. In contrast, reinforcement learning induces the core cognitive behaviors and generalizes to new domains, yielding gains over the base model on both code (improvement of 1.5x-2x) and math problems. Analysis of the reasoning traces reinforces these findings and highlights the promise of RL for improving LLMs' counterfactual reasoning.

Paper Structure

This paper contains 43 sections, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Executable counterfactuals across code, math, and a medical toy example to illustrate abduction--intervention--prediction. Code offers a controlled, executable setting that maps naturally to causal/computational graphs and transfers to natural-language tasks.
  • Figure 2: Template instance for generating if-else functions in the training set
  • Figure 3: Code function generated from template in \ref{['fig:meta_template']}
  • Figure 4: Another structurally different code function generated from the same template in \ref{['fig:meta_template']}
  • Figure 6: Even for LLMs with strong general capabilities or thinking features, the performance gap between counterfactual and interventional questions originated from the same code function can still be huge, showing the importance of targeted improvements in counterfactual reasoning.
  • ...and 2 more figures