A Critical Review of Causal Reasoning Benchmarks for Large Language Models
Linying Yang, Vik Shirvaikar, Oscar Clivio, Fabian Falck
TL;DR
The paper critiques current causal-reasoning benchmarks for LLMs, arguing that many assess retrieval or superficial cues rather than genuine causal understanding. It synthesizes existing datasets and tasks through causal hierarchies and the ladder of causation, identifying widespread design flaws such as multiple-choice formats, data leakage, and mislabeling. The authors propose four desirable criteria for robust benchmarks—causal language, open-endedness, scalability, and non-retrievability—and advocate a CLUE-style framework to standardize evaluation across simple and complex causal reasoning tasks. By outlining concrete guidance and highlighting recent trends toward interventional and counterfactual reasoning, the work aims to advance reliable assessment of causal understanding in LLMs and to inform future benchmark design and research directions.
Abstract
Numerous benchmarks aim to evaluate the capabilities of Large Language Models (LLMs) for causal inference and reasoning. However, many of them can likely be solved through the retrieval of domain knowledge, questioning whether they achieve their purpose. In this review, we present a comprehensive overview of LLM benchmarks for causality. We highlight how recent benchmarks move towards a more thorough definition of causal reasoning by incorporating interventional or counterfactual reasoning. We derive a set of criteria that a useful benchmark or set of benchmarks should aim to satisfy. We hope this work will pave the way towards a general framework for the assessment of causal understanding in LLMs and the design of novel benchmarks.
