Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

Donggyu Lee; Sungwon Park; Yerin Hwang; Hyoshin Kim; Hyunwoo Oh; Jungwon Kim; Meeyoung Cha; Sangyoon Park; Jihee Kim

Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

Donggyu Lee, Sungwon Park, Yerin Hwang, Hyoshin Kim, Hyunwoo Oh, Jungwon Kim, Meeyoung Cha, Sangyoon Park, Jihee Kim

TL;DR

This work builds a realism-grounded benchmark for LLM causal reasoning by deriving $ (X, d, Y) $ triplets from peer-reviewed economics and finance papers using rigorous causal designs such as $IV$, $DiD$, and RD. It generates 29,972 hard evaluation items across four task types that probe identification, spillover variation, multi-cause scenarios, and directional inference, validated by human annotation and JEL-topic mapping. Eight state-of-the-art LLMs show substantial gaps, with the best model achieving only about $60.6\%$ accuracy and Type 1 causality judgments averaging around $41.0\%$, highlighting a mismatch between scaling and genuine causal understanding. An open-book ablation reveals that providing external domain knowledge can improve performance (notably on Type 1), but complex, multi-hop reasoning remains largely constrained by the model’s intrinsic reasoning capabilities, underscoring the need for causality-focused mechanisms in high-stakes AI deployments.

Abstract

Causal reasoning is fundamental for Large Language Models (LLMs) to understand genuine cause-and-effect relationships beyond pattern matching. Existing benchmarks suffer from critical limitations such as reliance on synthetic data and narrow domain coverage. We introduce a novel benchmark constructed from casually identified relationships extracted from top-tier economics and finance journals, drawing on rigorous methodologies including instrumental variables, difference-in-differences, and regression discontinuity designs. Our benchmark comprises 40,379 evaluation items covering five task types across domains such as health, environment, technology, law, and culture. Experimental results on eight state-of-the-art LLMs reveal substantial limitations, with the best model achieving only 57.6\% accuracy. Moreover, model scale does not consistently translate to superior performance, and even advanced reasoning models struggle with fundamental causal relationship identification. These findings underscore a critical gap between current LLM capabilities and demands of reliable causal reasoning in high-stakes applications.

Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

TL;DR

This work builds a realism-grounded benchmark for LLM causal reasoning by deriving

triplets from peer-reviewed economics and finance papers using rigorous causal designs such as

, and RD. It generates 29,972 hard evaluation items across four task types that probe identification, spillover variation, multi-cause scenarios, and directional inference, validated by human annotation and JEL-topic mapping. Eight state-of-the-art LLMs show substantial gaps, with the best model achieving only about

accuracy and Type 1 causality judgments averaging around

, highlighting a mismatch between scaling and genuine causal understanding. An open-book ablation reveals that providing external domain knowledge can improve performance (notably on Type 1), but complex, multi-hop reasoning remains largely constrained by the model’s intrinsic reasoning capabilities, underscoring the need for causality-focused mechanisms in high-stakes AI deployments.

Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

TL;DR

Abstract

Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)