Table of Contents
Fetching ...

CausalBench: A Comprehensive Benchmark for Causal Learning Capability of LLMs

Yu Zhou, Xingyu Wu, Beicheng Huang, Jibin Wu, Liang Feng, Kay Chen Tan

TL;DR

CausalBench presents a thorough, multi-faceted benchmark to quantify causal learning in LLMs across datasets (2–109 nodes), tasks (correlation, skeleton, causality, and CoT-analogous reasoning), and prompts (including training data and background knowledge). Across 19 LLMs, closed-source models outperform open-source ones but still lag behind traditional causal-learning methods, especially on larger networks, with chain structures being easier to identify than colliders. The study reveals that LLMs benefit from long-text prompts and prior knowledge but rely primarily on semantic associations rather than distributional cues, and that CoT-style prompting substantially boosts causal reasoning in many cases. The work provides a practical framework for evaluating and guiding future improvements in LLM-based causal learning and suggests directions such as prompt enrichment, data-integration strategies, and targeted fine-tuning to handle larger, more complex causal graphs.

Abstract

The ability to understand causality significantly impacts the competence of large language models (LLMs) in output explanation and counterfactual reasoning, as causality reveals the underlying data distribution. However, the lack of a comprehensive benchmark currently limits the evaluation of LLMs' causal learning capabilities. To fill this gap, this paper develops CausalBench based on data from the causal research community, enabling comparative evaluations of LLMs against traditional causal learning algorithms. To provide a comprehensive investigation, we offer three tasks of varying difficulties, including correlation, causal skeleton, and causality identification. Evaluations of 19 leading LLMs reveal that, while closed-source LLMs show potential for simple causal relationships, they significantly lag behind traditional algorithms on larger-scale networks ($>50$ nodes). Specifically, LLMs struggle with collider structures but excel at chain structures, especially at long-chain causality analogous to Chains-of-Thought techniques. This supports the current prompt approaches while suggesting directions to enhance LLMs' causal reasoning capability. Furthermore, CausalBench incorporates background knowledge and training data into prompts to thoroughly unlock LLMs' text-comprehension ability during evaluation, whose findings indicate that, LLM understand causality through semantic associations with distinct entities, rather than directly from contextual information or numerical distributions.

CausalBench: A Comprehensive Benchmark for Causal Learning Capability of LLMs

TL;DR

CausalBench presents a thorough, multi-faceted benchmark to quantify causal learning in LLMs across datasets (2–109 nodes), tasks (correlation, skeleton, causality, and CoT-analogous reasoning), and prompts (including training data and background knowledge). Across 19 LLMs, closed-source models outperform open-source ones but still lag behind traditional causal-learning methods, especially on larger networks, with chain structures being easier to identify than colliders. The study reveals that LLMs benefit from long-text prompts and prior knowledge but rely primarily on semantic associations rather than distributional cues, and that CoT-style prompting substantially boosts causal reasoning in many cases. The work provides a practical framework for evaluating and guiding future improvements in LLM-based causal learning and suggests directions such as prompt enrichment, data-integration strategies, and targeted fine-tuning to handle larger, more complex causal graphs.

Abstract

The ability to understand causality significantly impacts the competence of large language models (LLMs) in output explanation and counterfactual reasoning, as causality reveals the underlying data distribution. However, the lack of a comprehensive benchmark currently limits the evaluation of LLMs' causal learning capabilities. To fill this gap, this paper develops CausalBench based on data from the causal research community, enabling comparative evaluations of LLMs against traditional causal learning algorithms. To provide a comprehensive investigation, we offer three tasks of varying difficulties, including correlation, causal skeleton, and causality identification. Evaluations of 19 leading LLMs reveal that, while closed-source LLMs show potential for simple causal relationships, they significantly lag behind traditional algorithms on larger-scale networks ( nodes). Specifically, LLMs struggle with collider structures but excel at chain structures, especially at long-chain causality analogous to Chains-of-Thought techniques. This supports the current prompt approaches while suggesting directions to enhance LLMs' causal reasoning capability. Furthermore, CausalBench incorporates background knowledge and training data into prompts to thoroughly unlock LLMs' text-comprehension ability during evaluation, whose findings indicate that, LLM understand causality through semantic associations with distinct entities, rather than directly from contextual information or numerical distributions.
Paper Structure (55 sections, 4 equations, 32 figures, 21 tables)

This paper contains 55 sections, 4 equations, 32 figures, 21 tables.

Figures (32)

  • Figure 1: Illustration of the overall evaluation process on CausalBench. The general evaluation framework is similar to existing evaluations, and the key differences between CausalBench and previous evaluations lie in each sub-process, where CausalBench possesses more standardized and comprehensive evaluation approaches. For instance, CausalBench features diverse prompt formats and evaluation tasks compared to the existing evaluations.
  • Figure 2: Illustration of CausalBench. CausalBench has four advantages, including diverse datasets from the causal learning community, three evaluation tasks of varying depths and difficulties, four prompt formats with rich information, and the demonstration of the upper limit of LLM capabilities across various scales and complexities. Specifically, in (a), CausalBench offers three tasks of different difficulties, i.e. correlation, causal skeleton identification, and causality, respectively, to holistically assess the causal learning capabilities of existing LLMs (e.g., gene, smoke and cancer). In (b), the variable name is the most prevalent prompt format in existing works; background knowledge is the domain knowledge for each variable in its field sourcing from Wikipedia and other encyclopedic websites, and training data refers to a matrix where columns denote nodes and rows denote observed samples (i.e., cases). In this paper, training data represents 500 observed samples for each variable. In CausalBench, four prompt formats are designed, including prompt 1 (i.e., variable name), prompt 2 (i.e., variable name + background knowledge), prompt 3 (i.e., variable name + training data), and prompt 4 (i.e., variable name + background knowledge + training data). In (c), CausalBench covers causal learning tasks of various scales, ranging from 5 to 109 nodes, evaluates various types of causal structures and discusses different densities in causal learning networks.
  • Figure 3: Construction process of CausalBench and comparison to existing evaluations. (a) Existing evaluations are confined to causal learning datasets with no more than 20 nodes, concentrating on causality assessment, and utilizing variable names directly as the input format to LLMs. (b) CausalBench integrates structured data, background knowledge, and a diverse set of ground truths, ranging from 2 to 109 nodes from the causal learning community in its data component; it defines three principal evaluation tasks: identifying correlation, causal skeleton, and causality respectively, in the task component; it introduces three new prompt formats: variable names coupled with structured data, variable names with background knowledge, and a combination of both in the prompt component. (c) The advantages of CausalBench include better exploitation of prior knowledge utilization and long-text comprehension with LLMs, understanding causal relationships of varying depths and difficulties, and demonstrating the upper limits of LLM abilities across various scales and complexities.
  • Figure 4: F1 scores of LLMs on direct correlation identification.
  • Figure 5: F1 scores of LLMs on indirect correlation identification.
  • ...and 27 more figures