On the Reliability of Large Language Models for Causal Discovery
Tao Feng, Lizhen Qu, Niket Tandon, Zhuang Li, Xiaoxi Kang, Gholamreza Haffari
TL;DR
This study evaluates how open-source large language models perform causal discovery, focusing on memorization, generalization, and contextual effects. By leveraging openly accessible pre-training corpora (Dolma and ROOTS) and real-world plus synthetic datasets, the authors show that LLMs memorize frequent causal relations but struggle to generalize to novel or rare ones. They also demonstrate that incorrect causal relations in pre-training can erode confidence in correct predictions, and that contextual information substantially modulates outcomes. The work suggests combining traditional statistical causal discovery with LLM-based approaches and highlights the need to curate pre-training data to minimize conflicting information, with broad implications for reliable causal reasoning in AI systems.
Abstract
This study investigates the efficacy of Large Language Models (LLMs) in causal discovery. Using newly available open-source LLMs, OLMo and BLOOM, which provide access to their pre-training corpora, we investigate how LLMs address causal discovery through three research questions. We examine: (i) the impact of memorization for accurate causal relation prediction, (ii) the influence of incorrect causal relations in pre-training data, and (iii) the contextual nuances that influence LLMs' understanding of causal relations. Our findings indicate that while LLMs are effective in recognizing causal relations that occur frequently in pre-training data, their ability to generalize to new or rare causal relations is limited. Moreover, the presence of incorrect causal relations significantly undermines the confidence of LLMs in corresponding correct causal relations, and the contextual information critically affects the outcomes of LLMs to discern causal connections between random variables.
