Table of Contents
Fetching ...

Causal Graph Discovery with Retrieval-Augmented Generation based Large Language Models

Yuzhe Zhang, Yipeng Zhang, Yidong Gan, Lina Yao, Chen Wang

TL;DR

The paper addresses causal graph recovery from observational data under data biases by introducing LACR, a retrieval-augmented generation framework that leverages large language models and scientific literature to extract associative evidence and verify causality. It follows a constraint-based paradigm, using a two-phase process—edge existence verification (LACR1) and orientation (LACR2)—to construct a skeleton from a knowledge base, with aggregation across multiple KBs guided by the Wisdom of the Crowd. The approach demonstrates improved skeleton quality and orientation results on real-world graphs (ASIA, SACHS, CORONARY) and highlights how up-to-date literature can refine ground-truth graphs, revealing gaps between standard benchmarks and current domain knowledge. Limitations include the quality of retrieved documents, the domain understanding of LLMs, and computational costs, pointing to future work in improved search, domain-adapted models, and broader, open-access corpora.

Abstract

Causal graph recovery is traditionally done using statistical estimation-based methods or based on individual's knowledge about variables of interests. They often suffer from data collection biases and limitations of individuals' knowledge. The advance of large language models (LLMs) provides opportunities to address these problems. We propose a novel method that leverages LLMs to deduce causal relationships in general causal graph recovery tasks. This method leverages knowledge compressed in LLMs and knowledge LLMs extracted from scientific publication database as well as experiment data about factors of interest to achieve this goal. Our method gives a prompting strategy to extract associational relationships among those factors and a mechanism to perform causality verification for these associations. Comparing to other LLM-based methods that directly instruct LLMs to do the highly complex causal reasoning, our method shows clear advantage on causal graph quality on benchmark datasets. More importantly, as causality among some factors may change as new research results emerge, our method show sensitivity to new evidence in the literature and can provide useful information for updating causal graphs accordingly.

Causal Graph Discovery with Retrieval-Augmented Generation based Large Language Models

TL;DR

The paper addresses causal graph recovery from observational data under data biases by introducing LACR, a retrieval-augmented generation framework that leverages large language models and scientific literature to extract associative evidence and verify causality. It follows a constraint-based paradigm, using a two-phase process—edge existence verification (LACR1) and orientation (LACR2)—to construct a skeleton from a knowledge base, with aggregation across multiple KBs guided by the Wisdom of the Crowd. The approach demonstrates improved skeleton quality and orientation results on real-world graphs (ASIA, SACHS, CORONARY) and highlights how up-to-date literature can refine ground-truth graphs, revealing gaps between standard benchmarks and current domain knowledge. Limitations include the quality of retrieved documents, the domain understanding of LLMs, and computational costs, pointing to future work in improved search, domain-adapted models, and broader, open-access corpora.

Abstract

Causal graph recovery is traditionally done using statistical estimation-based methods or based on individual's knowledge about variables of interests. They often suffer from data collection biases and limitations of individuals' knowledge. The advance of large language models (LLMs) provides opportunities to address these problems. We propose a novel method that leverages LLMs to deduce causal relationships in general causal graph recovery tasks. This method leverages knowledge compressed in LLMs and knowledge LLMs extracted from scientific publication database as well as experiment data about factors of interest to achieve this goal. Our method gives a prompting strategy to extract associational relationships among those factors and a mechanism to perform causality verification for these associations. Comparing to other LLM-based methods that directly instruct LLMs to do the highly complex causal reasoning, our method shows clear advantage on causal graph quality on benchmark datasets. More importantly, as causality among some factors may change as new research results emerge, our method show sensitivity to new evidence in the literature and can provide useful information for updating causal graphs accordingly.
Paper Structure (39 sections, 2 theorems, 5 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 39 sections, 2 theorems, 5 equations, 8 figures, 3 tables, 1 algorithm.

Key Result

Proposition 3.1

Assuming that estimating $\hat{\alpha}_{\mathtt{KB}}(ij\mid V')$ for a given $V'$ needs $O(1)$ time, inferring $\zeta_{\mathtt{KB}}(ij)$ requires $O(2^{n-2})$, where $n=|V|$.

Figures (8)

  • Figure 1: Causal graphs in Example \ref{['ex:bias']}: left-the truth causal graph; right-recovered causal graph by the biased data.
  • Figure 2: PC algorithm's process.
  • Figure 3: Ground truth causal graph of ASIA in lauritzen1988local.
  • Figure 4: Refined ground truth causal graph of ASIA by LACR.
  • Figure 5: Original ground truth causal graph of CORONARY in reinis1981prognostic.
  • ...and 3 more figures

Theorems & Definitions (9)

  • Definition 2.2: d-separation
  • Definition 2.4: Back-door criterion
  • Example 2.5
  • Proposition 3.1
  • proof
  • Proposition 3.2
  • proof
  • Example A.1
  • Example A.2