Table of Contents
Fetching ...

CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models

Yuefei Chen, Vivek K. Singh, Jing Ma, Ruxiang Tang

TL;DR

CounterBench introduces a formal, causality-grounded benchmark of 1K counterfactual questions to evaluate LLMs beyond commonsense reasoning. The authors show that contemporary LLMs struggle with counterfactual inference, even with prompting strategies like CausalCoT. They propose CoIn, a two-phase reasoning framework with iterative search and backtracking, achieving substantial gains and demonstrating strong generalization on the CLADDER dataset. The work provides a rigorous benchmark and a robust reasoning paradigm that can enhance causal reasoning in LLMs, with potential impact on domains requiring robust counterfactual analysis.

Abstract

Counterfactual reasoning is widely recognized as one of the most challenging and intricate aspects of causality in artificial intelligence. In this paper, we evaluate the performance of large language models (LLMs) in counterfactual reasoning. In contrast to previous studies that primarily focus on commonsense causal reasoning, where LLMs often rely on prior knowledge for inference, we specifically assess their ability to perform counterfactual inference using a set of formal rules. To support this evaluation, we introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions. The dataset is designed with varying levels of difficulty, diverse causal graph structures, distinct types of counterfactual questions, and multiple nonsensical name variants. Our experiments demonstrate that counterfactual reasoning poses a significant challenge for LLMs, with most models performing at levels comparable to random guessing. To enhance LLM's counterfactual reasoning ability, we propose a novel reasoning paradigm, CoIn, which guides LLMs through iterative reasoning and backtracking to systematically explore counterfactual solutions. Experimental results show that our method significantly improves LLM performance on counterfactual reasoning tasks and consistently enhances performance across different LLMs.Our dataset is available at https://huggingface.co/datasets/CounterBench/CounterBench.

CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models

TL;DR

CounterBench introduces a formal, causality-grounded benchmark of 1K counterfactual questions to evaluate LLMs beyond commonsense reasoning. The authors show that contemporary LLMs struggle with counterfactual inference, even with prompting strategies like CausalCoT. They propose CoIn, a two-phase reasoning framework with iterative search and backtracking, achieving substantial gains and demonstrating strong generalization on the CLADDER dataset. The work provides a rigorous benchmark and a robust reasoning paradigm that can enhance causal reasoning in LLMs, with potential impact on domains requiring robust counterfactual analysis.

Abstract

Counterfactual reasoning is widely recognized as one of the most challenging and intricate aspects of causality in artificial intelligence. In this paper, we evaluate the performance of large language models (LLMs) in counterfactual reasoning. In contrast to previous studies that primarily focus on commonsense causal reasoning, where LLMs often rely on prior knowledge for inference, we specifically assess their ability to perform counterfactual inference using a set of formal rules. To support this evaluation, we introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions. The dataset is designed with varying levels of difficulty, diverse causal graph structures, distinct types of counterfactual questions, and multiple nonsensical name variants. Our experiments demonstrate that counterfactual reasoning poses a significant challenge for LLMs, with most models performing at levels comparable to random guessing. To enhance LLM's counterfactual reasoning ability, we propose a novel reasoning paradigm, CoIn, which guides LLMs through iterative reasoning and backtracking to systematically explore counterfactual solutions. Experimental results show that our method significantly improves LLM performance on counterfactual reasoning tasks and consistently enhances performance across different LLMs.Our dataset is available at https://huggingface.co/datasets/CounterBench/CounterBench.

Paper Structure

This paper contains 33 sections, 2 equations, 14 figures, 6 tables, 1 algorithm.

Figures (14)

  • Figure 1: Comparison of accuracy scores on the CounterBench dataset across different strategies: our proposed CoIn paradigm versus baseline approaches (Standard, CausalCoT jin2023cladder, and Solver hua2024improving), evaluated using GPT-4o. Our CounterBench dataset includes four kinds types. Basic focuses on exploring how a single change in a causal variable. Joint involves simultaneous changes in multiple causes, Nested involves stepwise hypothetical assumptions about multiple variables, and Conditional evaluates counterfactuals under observed conditions.
  • Figure 2: Illustration of our framework. We create CounterBench, a dataset featuring four types of counterfactual questions (basic, joint, conditional, and nested). Based on this dataset, we conduct benchmark experiments on state-of-the-art methods, analyze their errors, and develop our enhanced reasoning framework that incorporates dynamic evaluation and intelligent backtracking capabilities.
  • Figure 3: Error Analysis of CausalCoT.
  • Figure 4: Accuracy comparison between Standard, CoIn, and CausalCoT in Anticommonsense and Commonsense Dataset.
  • Figure 5: The prompt design of CoIn
  • ...and 9 more figures