CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching
Yuzhe Wang, Yaochen Zhu, Jundong Li
TL;DR
The paper addresses a key gap in evaluating LLM causal judgment by showing that standard autoregressive training can rely on semantic correlations rather than true causal structure. It introduces CausalFlip, a benchmark built from semantically similar, label-flipped question pairs across confounder, chain, and collider structures, with pairwise train-test splits and Default/Alternative templates to penalize semantic shortcuts. It further proposes implicit causal reasoning by progressively masking intermediate reasoning steps during training and introduces a noisy-prefix evaluation to test robustness. Empirical results show that while explicit-CoT improves performance, it remains vulnerable to semantic cues, whereas implicit causal reasoning yields stronger, more robust causal grounding under noise, suggesting a promising direction for eliciting latent causal reasoning in base LLMs.
Abstract
As large language models (LLMs) witness increasing deployment in complex, high-stakes decision-making scenarios, it becomes imperative to ground their reasoning in causality rather than spurious correlations. However, strong performance on traditional reasoning benchmarks does not guarantee true causal reasoning ability of LLMs, as high accuracy may still arise from memorizing semantic patterns instead of analyzing the underlying true causal structures. To bridge this critical gap, we propose a new causal reasoning benchmark, CausalFlip, designed to encourage the development of new LLM paradigm or training algorithms that ground LLM reasoning in causality rather than semantic correlation. CausalFlip consists of causal judgment questions built over event triples that could form different confounder, chain, and collider relations. Based on this, for each event triple, we construct pairs of semantically similar questions that reuse the same events but yield opposite causal answers, where models that rely heavily on semantic matching are systematically driven toward incorrect predictions. To further probe models' reliance on semantic patterns, we introduce a noisy-prefix evaluation that prepends causally irrelevant text before intermediate causal reasoning steps without altering the underlying causal relations or the logic of the reasoning process. We evaluate LLMs under multiple training paradigms, including answer-only training, explicit Chain-of-Thought (CoT) supervision, and a proposed internalized causal reasoning approach that aims to mitigate explicit reliance on correlation in the reasoning process. Our results show that explicit CoT can still be misled by spurious semantic correlations, where internalizing reasoning steps yields substantially improved causal grounding, suggesting that it is promising to better elicit the latent causal reasoning capabilities of base LLMs.
