On the Reliability of Large Language Models for Causal Discovery

Tao Feng; Lizhen Qu; Niket Tandon; Zhuang Li; Xiaoxi Kang; Gholamreza Haffari

On the Reliability of Large Language Models for Causal Discovery

Tao Feng, Lizhen Qu, Niket Tandon, Zhuang Li, Xiaoxi Kang, Gholamreza Haffari

TL;DR

This study evaluates how open-source large language models perform causal discovery, focusing on memorization, generalization, and contextual effects. By leveraging openly accessible pre-training corpora (Dolma and ROOTS) and real-world plus synthetic datasets, the authors show that LLMs memorize frequent causal relations but struggle to generalize to novel or rare ones. They also demonstrate that incorrect causal relations in pre-training can erode confidence in correct predictions, and that contextual information substantially modulates outcomes. The work suggests combining traditional statistical causal discovery with LLM-based approaches and highlights the need to curate pre-training data to minimize conflicting information, with broad implications for reliable causal reasoning in AI systems.

Abstract

This study investigates the efficacy of Large Language Models (LLMs) in causal discovery. Using newly available open-source LLMs, OLMo and BLOOM, which provide access to their pre-training corpora, we investigate how LLMs address causal discovery through three research questions. We examine: (i) the impact of memorization for accurate causal relation prediction, (ii) the influence of incorrect causal relations in pre-training data, and (iii) the contextual nuances that influence LLMs' understanding of causal relations. Our findings indicate that while LLMs are effective in recognizing causal relations that occur frequently in pre-training data, their ability to generalize to new or rare causal relations is limited. Moreover, the presence of incorrect causal relations significantly undermines the confidence of LLMs in corresponding correct causal relations, and the contextual information critically affects the outcomes of LLMs to discern causal connections between random variables.

On the Reliability of Large Language Models for Causal Discovery

TL;DR

Abstract

Paper Structure (40 sections, 12 figures, 18 tables)

This paper contains 40 sections, 12 figures, 18 tables.

Introduction
Background
Methodology
RQ1.
RQ2.
RQ3.
Experimental Setup
Datasets
Tasks.
Real-World Data
Full Causal Discovery.
Causal Direction Identification.
Synthetic Data
Causal Direction Identification.
Models
...and 25 more sections

Figures (12)

Figure 1: The average F1 score and accuracy of OLMo-7b-Instruct by occurrence interval on full causal discovery tasks, where F1 and accuracy are computed from 0 to 4 ICL examples. The occurrence data of (a) and (b) are derived from the exact matching query, while the occurrence data of (c) and (d) are derived from the "event A" $\Rightarrow$ "causes" $\Rightarrow$ "event B" query. An asterisk (*) indicates a p-value < 0.05 for Pearson and Spearman correlation coefficients freedman2007statistics.
Figure 2: The average F1 score and accuracy of BLOOM-7b1 by occurrence interval on full causal discovery, averaged 0-4 ICL examples. The occurrence data are derived from the exact matching query.
Figure 3: The average F1 score and accuracy of OLMo-7b-Instruct by occurrence interval on causal direction identification task, averaged across 0 to 4 ICL examples. The occurrence data are derived from the exact matching query in the Dolma pre-training corpus.
Figure 4: The average F1 score and accuracy of OLMo-7b-Instruct by occurrence interval on causal direction identification task, averaged across 0 to 4 ICL examples. The occurrence data are derived from the "event A" $\Rightarrow$ "causes" $\Rightarrow$ "event B" query in the Dolma pre-training corpus.
Figure 5: The average F1 score and accuracy of BLOOM-7b1 by occurrence interval on causal direction identification task, averaged across 0 to 4 ICL examples. The occurrence data are derived from the exact matching query in the ROOTS pre-training corpus.
...and 7 more figures

On the Reliability of Large Language Models for Causal Discovery

TL;DR

Abstract

On the Reliability of Large Language Models for Causal Discovery

Authors

TL;DR

Abstract

Table of Contents

Figures (12)