CLEAR: Can Language Models Really Understand Causal Graphs?
Sirui Chen, Mengying Xu, Kun Wang, Xingyu Zeng, Rui Zhao, Shengjie Zhao, Chaochao Lu
TL;DR
This work tackles whether language models truly understand causal graphs by proposing a practical, multidisciplinary framework for understanding and introducing CLEAR, a benchmark with three complexity levels and 20 causal tasks. It defines four concrete behavioral criteria to assess understanding and evaluates six leading models using four prompting strategies, reporting that while there is preliminary understanding, substantial gaps remain, particularly in robustness to question types and in leveraging explicit definitions. Key findings include uneven task performance across graph-based tasks, strong basic-task competence, sensitivity to question form, partial benefits from definition guidance, and limited transfer across dependent tasks, with counterfactual analysis linking focus on causal-relevant information to correct reasoning. The study provides a public benchmark and a structured lens for evaluating and improving causal reasoning in language systems, offering practical insights for future development of reliable causal-graph reasoning in largescale models $($P > P_r$ for meaningful understanding$)$.
Abstract
Causal reasoning is a cornerstone of how humans interpret the world. To model and reason about causality, causal graphs offer a concise yet effective solution. Given the impressive advancements in language models, a crucial question arises: can they really understand causal graphs? To this end, we pioneer an investigation into language models' understanding of causal graphs. Specifically, we develop a framework to define causal graph understanding, by assessing language models' behaviors through four practical criteria derived from diverse disciplines (e.g., philosophy and psychology). We then develop CLEAR, a novel benchmark that defines three complexity levels and encompasses 20 causal graph-based tasks across these levels. Finally, based on our framework and benchmark, we conduct extensive experiments on six leading language models and summarize five empirical findings. Our results indicate that while language models demonstrate a preliminary understanding of causal graphs, significant potential for improvement remains. Our project website is at https://github.com/OpenCausaLab/CLEAR.
