Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships
Donggyu Lee, Sungwon Park, Yerin Hwang, Hyoshin Kim, Hyunwoo Oh, Jungwon Kim, Meeyoung Cha, Sangyoon Park, Jihee Kim
TL;DR
This work builds a realism-grounded benchmark for LLM causal reasoning by deriving $ (X, d, Y) $ triplets from peer-reviewed economics and finance papers using rigorous causal designs such as $IV$, $DiD$, and RD. It generates 29,972 hard evaluation items across four task types that probe identification, spillover variation, multi-cause scenarios, and directional inference, validated by human annotation and JEL-topic mapping. Eight state-of-the-art LLMs show substantial gaps, with the best model achieving only about $60.6\%$ accuracy and Type 1 causality judgments averaging around $41.0\%$, highlighting a mismatch between scaling and genuine causal understanding. An open-book ablation reveals that providing external domain knowledge can improve performance (notably on Type 1), but complex, multi-hop reasoning remains largely constrained by the model’s intrinsic reasoning capabilities, underscoring the need for causality-focused mechanisms in high-stakes AI deployments.
Abstract
Causal reasoning is fundamental for Large Language Models (LLMs) to understand genuine cause-and-effect relationships beyond pattern matching. Existing benchmarks suffer from critical limitations such as reliance on synthetic data and narrow domain coverage. We introduce a novel benchmark constructed from casually identified relationships extracted from top-tier economics and finance journals, drawing on rigorous methodologies including instrumental variables, difference-in-differences, and regression discontinuity designs. Our benchmark comprises 40,379 evaluation items covering five task types across domains such as health, environment, technology, law, and culture. Experimental results on eight state-of-the-art LLMs reveal substantial limitations, with the best model achieving only 57.6\% accuracy. Moreover, model scale does not consistently translate to superior performance, and even advanced reasoning models struggle with fundamental causal relationship identification. These findings underscore a critical gap between current LLM capabilities and demands of reliable causal reasoning in high-stakes applications.
