Table of Contents
Fetching ...

CLEAR: Can Language Models Really Understand Causal Graphs?

Sirui Chen, Mengying Xu, Kun Wang, Xingyu Zeng, Rui Zhao, Shengjie Zhao, Chaochao Lu

TL;DR

This work tackles whether language models truly understand causal graphs by proposing a practical, multidisciplinary framework for understanding and introducing CLEAR, a benchmark with three complexity levels and 20 causal tasks. It defines four concrete behavioral criteria to assess understanding and evaluates six leading models using four prompting strategies, reporting that while there is preliminary understanding, substantial gaps remain, particularly in robustness to question types and in leveraging explicit definitions. Key findings include uneven task performance across graph-based tasks, strong basic-task competence, sensitivity to question form, partial benefits from definition guidance, and limited transfer across dependent tasks, with counterfactual analysis linking focus on causal-relevant information to correct reasoning. The study provides a public benchmark and a structured lens for evaluating and improving causal reasoning in language systems, offering practical insights for future development of reliable causal-graph reasoning in largescale models $($P > P_r$ for meaningful understanding$)$.

Abstract

Causal reasoning is a cornerstone of how humans interpret the world. To model and reason about causality, causal graphs offer a concise yet effective solution. Given the impressive advancements in language models, a crucial question arises: can they really understand causal graphs? To this end, we pioneer an investigation into language models' understanding of causal graphs. Specifically, we develop a framework to define causal graph understanding, by assessing language models' behaviors through four practical criteria derived from diverse disciplines (e.g., philosophy and psychology). We then develop CLEAR, a novel benchmark that defines three complexity levels and encompasses 20 causal graph-based tasks across these levels. Finally, based on our framework and benchmark, we conduct extensive experiments on six leading language models and summarize five empirical findings. Our results indicate that while language models demonstrate a preliminary understanding of causal graphs, significant potential for improvement remains. Our project website is at https://github.com/OpenCausaLab/CLEAR.

CLEAR: Can Language Models Really Understand Causal Graphs?

TL;DR

This work tackles whether language models truly understand causal graphs by proposing a practical, multidisciplinary framework for understanding and introducing CLEAR, a benchmark with three complexity levels and 20 causal tasks. It defines four concrete behavioral criteria to assess understanding and evaluates six leading models using four prompting strategies, reporting that while there is preliminary understanding, substantial gaps remain, particularly in robustness to question types and in leveraging explicit definitions. Key findings include uneven task performance across graph-based tasks, strong basic-task competence, sensitivity to question form, partial benefits from definition guidance, and limited transfer across dependent tasks, with counterfactual analysis linking focus on causal-relevant information to correct reasoning. The study provides a public benchmark and a structured lens for evaluating and improving causal reasoning in language systems, offering practical insights for future development of reliable causal-graph reasoning in largescale models P > P_r)$.

Abstract

Causal reasoning is a cornerstone of how humans interpret the world. To model and reason about causality, causal graphs offer a concise yet effective solution. Given the impressive advancements in language models, a crucial question arises: can they really understand causal graphs? To this end, we pioneer an investigation into language models' understanding of causal graphs. Specifically, we develop a framework to define causal graph understanding, by assessing language models' behaviors through four practical criteria derived from diverse disciplines (e.g., philosophy and psychology). We then develop CLEAR, a novel benchmark that defines three complexity levels and encompasses 20 causal graph-based tasks across these levels. Finally, based on our framework and benchmark, we conduct extensive experiments on six leading language models and summarize five empirical findings. Our results indicate that while language models demonstrate a preliminary understanding of causal graphs, significant potential for improvement remains. Our project website is at https://github.com/OpenCausaLab/CLEAR.
Paper Structure (40 sections, 12 figures, 5 tables)

This paper contains 40 sections, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Performance of six leading language models across 20 diverse tasks in CLEAR. Further details on the experimental results can be found in \ref{['sec:experiment']}.
  • Figure 2: Hierarchy and dependent relationships of tasks in CLEAR. We define three complexity levels. (1) Level 1: Basic Task. Mastering these concepts is a prerequisite for understanding any general graph. (2) Level 2: Intermediate Task. These tasks represent the most common characteristics in causal graphs. Causal graph-based reasoning relies heavily on understanding these fundamental problems. (3) Level 3: Advanced Task. These tasks present complex, high-level challenges that are central to causal graph understanding. Solid arrows indicate the dependencies between tasks within the same level, while dashed arrows represent the tasks' dependencies across different levels. Task dependency design draws on established research shpitser2006identificationpearl2009causalitybareinboim2012causalpearl2016causalpearl2018bookjaber2019causal.
  • Figure 3: Six question types. Taking the backdoor path as an example, we design six question types in CLEAR. A complete question is formulated by combining the causal graph info with a specific question type.
  • Figure 4: Overall model performance. Each cell corresponds to the model's accuracy on that specific task.
  • Figure 5: Model performance across the three levels of CLEAR. The term Mixtral refers to Mixtral-8$\times$7B, Llama2 to Llama2-Chat-70B, and InternLM2 to InternLM2-Math-20B.
  • ...and 7 more figures