Table of Contents
Fetching ...

CausalT5K: Diagnosing and Informing Refusal for Trustworthy Causal Reasoning of Skepticism, Sycophancy, Detection-Correction, and Rung Collapse

Longling Geng, Andy Ouyang, Theodore Wu, Daphne Barretto, Matthew John Hayes, Rachael Cooper, Yuqiao Zeng, Sameer Vijay, Gia Ancone, Ankit Rai, Matthew Wolfman, Patrick Flanagan, Edward Y. Chang

TL;DR

CausalT5k introduces a diagnostic benchmark for trustworthy causal reasoning in LLMs by embedding Pearl's Ladder of Causation into 5,147 validated cases across 10 domains. It combines a Sheep/Wolf trap taxonomy, adversarial pressure variants, and a Wise Refusal protocol to measure three capabilities: detection of associational bias, diagnosis and refusal under underdetermination, and counterfactual imagination (Rung 3), all analyzed with a two-axis Utility/Safety decomposition across a Four-Quadrant Control Landscape. The paper reports four key findings—Skepticism Trap, inverse scaling of sycophancy, a persistent Detection-Correction Gap, and Rung Collapse—revealing failure modes invisible to aggregate accuracy and demonstrating judge-model dynamics under audit pressure. By detailing a rigorous human-machine construction pipeline and validating its ecological and structural integrity, CausalT5k provides a foundation for diagnostic benchmarks and adaptive remediation strategies to improve trustworthy causal reasoning in real-world AI systems.

Abstract

LLM failures in causal reasoning, including sycophancy, rung collapse, and miscalibrated refusal, are well-documented, yet progress on remediation is slow because no benchmark enables systematic diagnosis. We introduce CausalT5K, a diagnostic benchmark of over 5,000 cases across 10 domains that tests three critical capabilities: (1) detecting rung collapse, where models answer interventional queries with associational evidence; (2) resisting sycophantic drift under adversarial pressure; and (3) generating Wise Refusals that specify missing information when evidence is underdetermined. Unlike synthetic benchmarks, CausalT5K embeds causal traps in realistic narratives and decomposes performance into Utility (sensitivity) and Safety (specificity), revealing failure modes invisible to aggregate accuracy. Developed through a rigorous human-machine collaborative pipeline involving 40 domain experts, iterative cross-validation cycles, and composite verification via rule-based, LLM, and human scoring, CausalT5K implements Pearl's Ladder of Causation as research infrastructure. Preliminary experiments reveal a Four-Quadrant Control Landscape where static audit policies universally fail, a finding that demonstrates CausalT5K's value for advancing trustworthy reasoning systems. Repository: https://github.com/genglongling/CausalT5kBench

CausalT5K: Diagnosing and Informing Refusal for Trustworthy Causal Reasoning of Skepticism, Sycophancy, Detection-Correction, and Rung Collapse

TL;DR

CausalT5k introduces a diagnostic benchmark for trustworthy causal reasoning in LLMs by embedding Pearl's Ladder of Causation into 5,147 validated cases across 10 domains. It combines a Sheep/Wolf trap taxonomy, adversarial pressure variants, and a Wise Refusal protocol to measure three capabilities: detection of associational bias, diagnosis and refusal under underdetermination, and counterfactual imagination (Rung 3), all analyzed with a two-axis Utility/Safety decomposition across a Four-Quadrant Control Landscape. The paper reports four key findings—Skepticism Trap, inverse scaling of sycophancy, a persistent Detection-Correction Gap, and Rung Collapse—revealing failure modes invisible to aggregate accuracy and demonstrating judge-model dynamics under audit pressure. By detailing a rigorous human-machine construction pipeline and validating its ecological and structural integrity, CausalT5k provides a foundation for diagnostic benchmarks and adaptive remediation strategies to improve trustworthy causal reasoning in real-world AI systems.

Abstract

LLM failures in causal reasoning, including sycophancy, rung collapse, and miscalibrated refusal, are well-documented, yet progress on remediation is slow because no benchmark enables systematic diagnosis. We introduce CausalT5K, a diagnostic benchmark of over 5,000 cases across 10 domains that tests three critical capabilities: (1) detecting rung collapse, where models answer interventional queries with associational evidence; (2) resisting sycophantic drift under adversarial pressure; and (3) generating Wise Refusals that specify missing information when evidence is underdetermined. Unlike synthetic benchmarks, CausalT5K embeds causal traps in realistic narratives and decomposes performance into Utility (sensitivity) and Safety (specificity), revealing failure modes invisible to aggregate accuracy. Developed through a rigorous human-machine collaborative pipeline involving 40 domain experts, iterative cross-validation cycles, and composite verification via rule-based, LLM, and human scoring, CausalT5K implements Pearl's Ladder of Causation as research infrastructure. Preliminary experiments reveal a Four-Quadrant Control Landscape where static audit policies universally fail, a finding that demonstrates CausalT5K's value for advancing trustworthy reasoning systems. Repository: https://github.com/genglongling/CausalT5kBench
Paper Structure (107 sections, 5 figures, 20 tables)

This paper contains 107 sections, 5 figures, 20 tables.

Figures (5)

  • Figure 1: The Four-Quadrant Control Landscape, enabled by CausalT5k 's pressure variants. Models are classified by Paranoia Rate ($y$-axis) and Sycophancy Ratio ($x$-axis). Static control policies fail: weak control permits sycophancy (Q4); excessive control induces paranoia (Q3). This classification is impossible without paired neutral/pressure cases.
  • Figure 2: Three causal structures underlying Tier 2 traps. Dashed arrows indicate spurious or incorrectly relationships.
  • Figure 3: Counterfactual structures where simple associative reasoning fails. (a) Two sufficient causes make "but-for" tests fail. (b) Outcome depends on trajectory. (c) Answer depends on whether mediator $M$ is held fixed.
  • Figure 4: Contributor standing distribution.
  • Figure 5: Contributor major distribution using broad buckets.