Can large language models build causal graphs?

Stephanie Long; Tibor Schuster; Alexandre Piché

Can large language models build causal graphs?

Stephanie Long, Tibor Schuster, Alexandre Piché

TL;DR

The paper investigates whether large language models can assist in building causal diagrams for medical contexts by evaluating GPT-3 on four ground-truth DAGs. It uses a pairwise edge-signaling approach, testing prompt engineering, linking verbs, and specificity to see how accurately the model can indicate the presence or absence of edges. Key contributions include quantifying how language choices affect edge-detection accuracy and highlighting the limitations of LLMs for causal graph construction without expert oversight. The findings suggest LLMs can speed up knowledge extraction from medical literature to complement expert-driven DAG development, pointing to a path toward scalable causal diagram construction with appropriate validation.

Abstract

Building causal graphs can be a laborious process. To ensure all relevant causal pathways have been captured, researchers often have to discuss with clinicians and experts while also reviewing extensive relevant medical literature. By encoding common and medical knowledge, large language models (LLMs) represent an opportunity to ease this process by automatically scoring edges (i.e., connections between two variables) in potential graphs. LLMs however have been shown to be brittle to the choice of probing words, context, and prompts that the user employs. In this work, we evaluate if LLMs can be a useful tool in complementing causal graph development.

Can large language models build causal graphs?

TL;DR

Abstract

Paper Structure (14 sections, 2 figures, 3 tables)

This paper contains 14 sections, 2 figures, 3 tables.

Introduction
Background
Large language models
Causal diagram overview
Experiments
Experimental details
Results
Q1: Does using prompt engineering lead to more accurate answers?
Q2: Does the verb used to denote the relationship between the variables have an impact on accuracy?
Q3: Does specificity in language improve accuracy?
Discussion
Limitations
Future work
Conclusion

Figures (2)

Figure 1: Overview of the evaluation. To predict the structure of a given causal graph, for every ordered variable pair, we scored two statements using GPT-3, where the first statement implied the presence of an arrow and the second implied the absence of an arrow. GTP-3 was accurate if the correct statement had a higher accuracy score than the incorrect statement. For example, GPT-3 would be accurate if the statement implying the presence (or absence) of an arrow had a higher accuracy than the incorrect statement and the arrow was present (or absent) in the true DAG.
Figure 2: Ground Truth DAGs. Four DAGs illustrating well-known exposure-outcome effects in the medical literature. DAG (A) represents the simplest DAG evaluated by GPT-3. DAGs (B-D) represent more complex structures involving a collider variable (node with two arrows pointing into it e.g., 'respiratory disease', 'body weight', and 'heart failure') with a common cause with the outcome.

Can large language models build causal graphs?

TL;DR

Abstract

Can large language models build causal graphs?

Authors

TL;DR

Abstract

Table of Contents

Figures (2)