Table of Contents
Fetching ...

Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

Donggyu Lee, Sungwon Park, Yerin Hwang, Hyoshin Kim, Hyunwoo Oh, Jungwon Kim, Meeyoung Cha, Sangyoon Park, Jihee Kim

TL;DR

This work builds a realism-grounded benchmark for LLM causal reasoning by deriving $ (X, d, Y) $ triplets from peer-reviewed economics and finance papers using rigorous causal designs such as $IV$, $DiD$, and RD. It generates 29,972 hard evaluation items across four task types that probe identification, spillover variation, multi-cause scenarios, and directional inference, validated by human annotation and JEL-topic mapping. Eight state-of-the-art LLMs show substantial gaps, with the best model achieving only about $60.6\%$ accuracy and Type 1 causality judgments averaging around $41.0\%$, highlighting a mismatch between scaling and genuine causal understanding. An open-book ablation reveals that providing external domain knowledge can improve performance (notably on Type 1), but complex, multi-hop reasoning remains largely constrained by the model’s intrinsic reasoning capabilities, underscoring the need for causality-focused mechanisms in high-stakes AI deployments.

Abstract

Causal reasoning is fundamental for Large Language Models (LLMs) to understand genuine cause-and-effect relationships beyond pattern matching. Existing benchmarks suffer from critical limitations such as reliance on synthetic data and narrow domain coverage. We introduce a novel benchmark constructed from casually identified relationships extracted from top-tier economics and finance journals, drawing on rigorous methodologies including instrumental variables, difference-in-differences, and regression discontinuity designs. Our benchmark comprises 40,379 evaluation items covering five task types across domains such as health, environment, technology, law, and culture. Experimental results on eight state-of-the-art LLMs reveal substantial limitations, with the best model achieving only 57.6\% accuracy. Moreover, model scale does not consistently translate to superior performance, and even advanced reasoning models struggle with fundamental causal relationship identification. These findings underscore a critical gap between current LLM capabilities and demands of reliable causal reasoning in high-stakes applications.

Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

TL;DR

This work builds a realism-grounded benchmark for LLM causal reasoning by deriving triplets from peer-reviewed economics and finance papers using rigorous causal designs such as , , and RD. It generates 29,972 hard evaluation items across four task types that probe identification, spillover variation, multi-cause scenarios, and directional inference, validated by human annotation and JEL-topic mapping. Eight state-of-the-art LLMs show substantial gaps, with the best model achieving only about accuracy and Type 1 causality judgments averaging around , highlighting a mismatch between scaling and genuine causal understanding. An open-book ablation reveals that providing external domain knowledge can improve performance (notably on Type 1), but complex, multi-hop reasoning remains largely constrained by the model’s intrinsic reasoning capabilities, underscoring the need for causality-focused mechanisms in high-stakes AI deployments.

Abstract

Causal reasoning is fundamental for Large Language Models (LLMs) to understand genuine cause-and-effect relationships beyond pattern matching. Existing benchmarks suffer from critical limitations such as reliance on synthetic data and narrow domain coverage. We introduce a novel benchmark constructed from casually identified relationships extracted from top-tier economics and finance journals, drawing on rigorous methodologies including instrumental variables, difference-in-differences, and regression discontinuity designs. Our benchmark comprises 40,379 evaluation items covering five task types across domains such as health, environment, technology, law, and culture. Experimental results on eight state-of-the-art LLMs reveal substantial limitations, with the best model achieving only 57.6\% accuracy. Moreover, model scale does not consistently translate to superior performance, and even advanced reasoning models struggle with fundamental causal relationship identification. These findings underscore a critical gap between current LLM capabilities and demands of reliable causal reasoning in high-stakes applications.

Paper Structure

This paper contains 37 sections, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Overall Process of Causal Relation Extraction and Benchmark Generation
  • Figure 2: Five-year accuracy on Type 1 data (2000--2025) with 0--100 scale on the y-axis.