CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching

Yuzhe Wang; Yaochen Zhu; Jundong Li

CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching

Yuzhe Wang, Yaochen Zhu, Jundong Li

TL;DR

The paper addresses a key gap in evaluating LLM causal judgment by showing that standard autoregressive training can rely on semantic correlations rather than true causal structure. It introduces CausalFlip, a benchmark built from semantically similar, label-flipped question pairs across confounder, chain, and collider structures, with pairwise train-test splits and Default/Alternative templates to penalize semantic shortcuts. It further proposes implicit causal reasoning by progressively masking intermediate reasoning steps during training and introduces a noisy-prefix evaluation to test robustness. Empirical results show that while explicit-CoT improves performance, it remains vulnerable to semantic cues, whereas implicit causal reasoning yields stronger, more robust causal grounding under noise, suggesting a promising direction for eliciting latent causal reasoning in base LLMs.

Abstract

As large language models (LLMs) witness increasing deployment in complex, high-stakes decision-making scenarios, it becomes imperative to ground their reasoning in causality rather than spurious correlations. However, strong performance on traditional reasoning benchmarks does not guarantee true causal reasoning ability of LLMs, as high accuracy may still arise from memorizing semantic patterns instead of analyzing the underlying true causal structures. To bridge this critical gap, we propose a new causal reasoning benchmark, CausalFlip, designed to encourage the development of new LLM paradigm or training algorithms that ground LLM reasoning in causality rather than semantic correlation. CausalFlip consists of causal judgment questions built over event triples that could form different confounder, chain, and collider relations. Based on this, for each event triple, we construct pairs of semantically similar questions that reuse the same events but yield opposite causal answers, where models that rely heavily on semantic matching are systematically driven toward incorrect predictions. To further probe models' reliance on semantic patterns, we introduce a noisy-prefix evaluation that prepends causally irrelevant text before intermediate causal reasoning steps without altering the underlying causal relations or the logic of the reasoning process. We evaluate LLMs under multiple training paradigms, including answer-only training, explicit Chain-of-Thought (CoT) supervision, and a proposed internalized causal reasoning approach that aims to mitigate explicit reliance on correlation in the reasoning process. Our results show that explicit CoT can still be misled by spurious semantic correlations, where internalizing reasoning steps yields substantially improved causal grounding, suggesting that it is promising to better elicit the latent causal reasoning capabilities of base LLMs.

CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching

TL;DR

Abstract

Paper Structure (33 sections, 11 equations, 3 figures, 3 tables)

This paper contains 33 sections, 11 equations, 3 figures, 3 tables.

Introduction
Preliminaries
LLM Basics
Chain-of-Thought Reasoning
Problem Formulation
Benchmark Design
Semantically Similar, Label-Flipped Pairs with Pairwise Train–Test Split
Causal Structures and Induced Questions
Confounder Dataset.
Chain Dataset.
Collider Dataset.
Question Templates
Implicit Causal Reasoning
Motivation and Overview
Progressive Causal Reasoning Steps Mask
...and 18 more sections

Figures (3)

Figure 1: One representative example where training samples may create a spurious semantic correlation with a wrong answer that leads LLM to an incorrect causal judgment.
Figure 2: Overview of the causal structures used in our benchmark and an example of base, opposite pairs. The top row shows the causal structures of three sub-datasets (confounder, chain, collider); The bottom expands the confounder case and highlights a base vs. opposite question pairs: In both base and opposite question pairs, Q1 asks whether the causal relation from X to Y exists under the context of Z, with fixed template: Will the increase of X cause Y during Z?, and Q2 asks whether the causal relations between Z and X / Y exists, with fixed template: Will Z cause the increase of X and Y.
Figure 3: Accuracy of explicit-CoT versus implicit causal reasoning on CausalFlip across the three sub-datasets under clean inputs and noisy-prefix. Implicit causal reasoning consistently degrades less and perform better than explicit-CoT after the injection of noisy prefix, indicating its reduced reliance on spurious semantic correlations.

CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching

TL;DR

Abstract

CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching

Authors

TL;DR

Abstract

Table of Contents

Figures (3)