CausalT5K: Diagnosing and Informing Refusal for Trustworthy Causal Reasoning of Skepticism, Sycophancy, Detection-Correction, and Rung Collapse
Longling Geng, Andy Ouyang, Theodore Wu, Daphne Barretto, Matthew John Hayes, Rachael Cooper, Yuqiao Zeng, Sameer Vijay, Gia Ancone, Ankit Rai, Matthew Wolfman, Patrick Flanagan, Edward Y. Chang
TL;DR
CausalT5k introduces a diagnostic benchmark for trustworthy causal reasoning in LLMs by embedding Pearl's Ladder of Causation into 5,147 validated cases across 10 domains. It combines a Sheep/Wolf trap taxonomy, adversarial pressure variants, and a Wise Refusal protocol to measure three capabilities: detection of associational bias, diagnosis and refusal under underdetermination, and counterfactual imagination (Rung 3), all analyzed with a two-axis Utility/Safety decomposition across a Four-Quadrant Control Landscape. The paper reports four key findings—Skepticism Trap, inverse scaling of sycophancy, a persistent Detection-Correction Gap, and Rung Collapse—revealing failure modes invisible to aggregate accuracy and demonstrating judge-model dynamics under audit pressure. By detailing a rigorous human-machine construction pipeline and validating its ecological and structural integrity, CausalT5k provides a foundation for diagnostic benchmarks and adaptive remediation strategies to improve trustworthy causal reasoning in real-world AI systems.
Abstract
LLM failures in causal reasoning, including sycophancy, rung collapse, and miscalibrated refusal, are well-documented, yet progress on remediation is slow because no benchmark enables systematic diagnosis. We introduce CausalT5K, a diagnostic benchmark of over 5,000 cases across 10 domains that tests three critical capabilities: (1) detecting rung collapse, where models answer interventional queries with associational evidence; (2) resisting sycophantic drift under adversarial pressure; and (3) generating Wise Refusals that specify missing information when evidence is underdetermined. Unlike synthetic benchmarks, CausalT5K embeds causal traps in realistic narratives and decomposes performance into Utility (sensitivity) and Safety (specificity), revealing failure modes invisible to aggregate accuracy. Developed through a rigorous human-machine collaborative pipeline involving 40 domain experts, iterative cross-validation cycles, and composite verification via rule-based, LLM, and human scoring, CausalT5K implements Pearl's Ladder of Causation as research infrastructure. Preliminary experiments reveal a Four-Quadrant Control Landscape where static audit policies universally fail, a finding that demonstrates CausalT5K's value for advancing trustworthy reasoning systems. Repository: https://github.com/genglongling/CausalT5kBench
