Can large language models build causal graphs?
Stephanie Long, Tibor Schuster, Alexandre Piché
TL;DR
The paper investigates whether large language models can assist in building causal diagrams for medical contexts by evaluating GPT-3 on four ground-truth DAGs. It uses a pairwise edge-signaling approach, testing prompt engineering, linking verbs, and specificity to see how accurately the model can indicate the presence or absence of edges. Key contributions include quantifying how language choices affect edge-detection accuracy and highlighting the limitations of LLMs for causal graph construction without expert oversight. The findings suggest LLMs can speed up knowledge extraction from medical literature to complement expert-driven DAG development, pointing to a path toward scalable causal diagram construction with appropriate validation.
Abstract
Building causal graphs can be a laborious process. To ensure all relevant causal pathways have been captured, researchers often have to discuss with clinicians and experts while also reviewing extensive relevant medical literature. By encoding common and medical knowledge, large language models (LLMs) represent an opportunity to ease this process by automatically scoring edges (i.e., connections between two variables) in potential graphs. LLMs however have been shown to be brittle to the choice of probing words, context, and prompts that the user employs. In this work, we evaluate if LLMs can be a useful tool in complementing causal graph development.
