CausalBench: A Comprehensive Benchmark for Causal Learning Capability of LLMs
Yu Zhou, Xingyu Wu, Beicheng Huang, Jibin Wu, Liang Feng, Kay Chen Tan
TL;DR
CausalBench presents a thorough, multi-faceted benchmark to quantify causal learning in LLMs across datasets (2–109 nodes), tasks (correlation, skeleton, causality, and CoT-analogous reasoning), and prompts (including training data and background knowledge). Across 19 LLMs, closed-source models outperform open-source ones but still lag behind traditional causal-learning methods, especially on larger networks, with chain structures being easier to identify than colliders. The study reveals that LLMs benefit from long-text prompts and prior knowledge but rely primarily on semantic associations rather than distributional cues, and that CoT-style prompting substantially boosts causal reasoning in many cases. The work provides a practical framework for evaluating and guiding future improvements in LLM-based causal learning and suggests directions such as prompt enrichment, data-integration strategies, and targeted fine-tuning to handle larger, more complex causal graphs.
Abstract
The ability to understand causality significantly impacts the competence of large language models (LLMs) in output explanation and counterfactual reasoning, as causality reveals the underlying data distribution. However, the lack of a comprehensive benchmark currently limits the evaluation of LLMs' causal learning capabilities. To fill this gap, this paper develops CausalBench based on data from the causal research community, enabling comparative evaluations of LLMs against traditional causal learning algorithms. To provide a comprehensive investigation, we offer three tasks of varying difficulties, including correlation, causal skeleton, and causality identification. Evaluations of 19 leading LLMs reveal that, while closed-source LLMs show potential for simple causal relationships, they significantly lag behind traditional algorithms on larger-scale networks ($>50$ nodes). Specifically, LLMs struggle with collider structures but excel at chain structures, especially at long-chain causality analogous to Chains-of-Thought techniques. This supports the current prompt approaches while suggesting directions to enhance LLMs' causal reasoning capability. Furthermore, CausalBench incorporates background knowledge and training data into prompts to thoroughly unlock LLMs' text-comprehension ability during evaluation, whose findings indicate that, LLM understand causality through semantic associations with distinct entities, rather than directly from contextual information or numerical distributions.
