Table of Contents
Fetching ...

A Critical Review of Causal Reasoning Benchmarks for Large Language Models

Linying Yang, Vik Shirvaikar, Oscar Clivio, Fabian Falck

TL;DR

The paper critiques current causal-reasoning benchmarks for LLMs, arguing that many assess retrieval or superficial cues rather than genuine causal understanding. It synthesizes existing datasets and tasks through causal hierarchies and the ladder of causation, identifying widespread design flaws such as multiple-choice formats, data leakage, and mislabeling. The authors propose four desirable criteria for robust benchmarks—causal language, open-endedness, scalability, and non-retrievability—and advocate a CLUE-style framework to standardize evaluation across simple and complex causal reasoning tasks. By outlining concrete guidance and highlighting recent trends toward interventional and counterfactual reasoning, the work aims to advance reliable assessment of causal understanding in LLMs and to inform future benchmark design and research directions.

Abstract

Numerous benchmarks aim to evaluate the capabilities of Large Language Models (LLMs) for causal inference and reasoning. However, many of them can likely be solved through the retrieval of domain knowledge, questioning whether they achieve their purpose. In this review, we present a comprehensive overview of LLM benchmarks for causality. We highlight how recent benchmarks move towards a more thorough definition of causal reasoning by incorporating interventional or counterfactual reasoning. We derive a set of criteria that a useful benchmark or set of benchmarks should aim to satisfy. We hope this work will pave the way towards a general framework for the assessment of causal understanding in LLMs and the design of novel benchmarks.

A Critical Review of Causal Reasoning Benchmarks for Large Language Models

TL;DR

The paper critiques current causal-reasoning benchmarks for LLMs, arguing that many assess retrieval or superficial cues rather than genuine causal understanding. It synthesizes existing datasets and tasks through causal hierarchies and the ladder of causation, identifying widespread design flaws such as multiple-choice formats, data leakage, and mislabeling. The authors propose four desirable criteria for robust benchmarks—causal language, open-endedness, scalability, and non-retrievability—and advocate a CLUE-style framework to standardize evaluation across simple and complex causal reasoning tasks. By outlining concrete guidance and highlighting recent trends toward interventional and counterfactual reasoning, the work aims to advance reliable assessment of causal understanding in LLMs and to inform future benchmark design and research directions.

Abstract

Numerous benchmarks aim to evaluate the capabilities of Large Language Models (LLMs) for causal inference and reasoning. However, many of them can likely be solved through the retrieval of domain knowledge, questioning whether they achieve their purpose. In this review, we present a comprehensive overview of LLM benchmarks for causality. We highlight how recent benchmarks move towards a more thorough definition of causal reasoning by incorporating interventional or counterfactual reasoning. We derive a set of criteria that a useful benchmark or set of benchmarks should aim to satisfy. We hope this work will pave the way towards a general framework for the assessment of causal understanding in LLMs and the design of novel benchmarks.
Paper Structure (7 sections, 7 figures)

This paper contains 7 sections, 7 figures.

Figures (7)

  • Figure 1: An example of causal relation identification tasks.
  • Figure 2: Examples from com2sense (taken from singh2021com2sense and Srivastava2022BeyondTI).
  • Figure 3: An example from intuitive physics Zecevic2023CausalPL.
  • Figure 4: An example from CEG task.
  • Figure 5: Examples of incorrect ground-truth explanations in e-CARE.
  • ...and 2 more figures