Better Think Thrice: Learning to Reason Causally with Double Counterfactual Consistency
Victoria Lin, Xinnuo Xu, Rachel Lawrence, Risa Ueno, Amit Sharma, Javier Gonzalez, Niranjani Prasad
TL;DR
This work addresses the gap in evaluating causal reasoning for large language models by introducing double counterfactual consistency (DCC), a lightweight, inference-time procedure that tests whether a model can perform a causal intervention and correctly predict the counterfactual outcome, then revert the intervention. DCC is proposed as a triad: a metric, an inference-time rejection sampler, and a training-time reward, enabling evaluation and improvement of causal reasoning without labeled counterfactual data. Empirical results across GSM8K, CruxEval, and MATH show that DCC captures a distinct aspect of reasoning beyond standard accuracy and can improve performance on counterfactual tasks, though gains depend on model capacity and dataset characteristics. The approach offers a scalable, domain-agnostic way to disentangle factual accuracy from causal reasoning, with practical implications for improving reliability in decision-support and scientific tasks.
Abstract
Despite their strong performance on reasoning benchmarks, large language models (LLMs) have proven brittle when presented with counterfactual questions, suggesting weaknesses in their causal reasoning ability. While recent work has demonstrated that labeled counterfactual tasks can be useful benchmarks of LLMs' causal reasoning, producing such data at the scale required to cover the vast potential space of counterfactuals is limited. In this work, we introduce double counterfactual consistency (DCC), a lightweight inference-time method for measuring and guiding the ability of LLMs to reason causally. Without requiring labeled counterfactual data, DCC verifies a model's ability to execute two important elements of causal reasoning: causal intervention and counterfactual prediction. Using DCC, we evaluate the causal reasoning abilities of various leading LLMs across a range of reasoning tasks and interventions. Moreover, we demonstrate the effectiveness of DCC as a training-free test-time rejection sampling criterion and show that it can directly improve performance on reasoning tasks across multiple model families.
