Table of Contents
Fetching ...

Large Language Models are Effective Priors for Causal Graph Discovery

Victor-Alexandru Darvariu, Stephen Hailes, Mirco Musolesi

TL;DR

This work investigates using Large Language Models (LLMs) as soft priors to guide causal graph discovery under limited data. It introduces a probabilistic expert interaction model, a set of priors-evaluation metrics, and a flexible integration framework with CD-UCT, then empirically shows that 3-Way prompting and the combination of mutual information priors with LLM priors yield improved causal graphs, especially when computational budgets are tight. The main contributions are (i) a principled way to quantify and compare LLM-derived priors for causal discovery, (ii) a prompting Design space demonstrating when LLMs provide reliable directionality signals, and (iii) a practical integration strategy that outperforms hard priors and baseline MI priors on common-sense benchmarks. The findings highlight the potential of LLMs as soft background knowledge for causal structure learning while acknowledging limitations related to domain-specific knowledge, scalability, and possible data leakage, pointing to future work on richer interactive setups and domain-tailored LLMs.

Abstract

Causal structure discovery from observations can be improved by integrating background knowledge provided by an expert to reduce the hypothesis space. Recently, Large Language Models (LLMs) have begun to be considered as sources of prior information given the low cost of querying them relative to a human expert. In this work, firstly, we propose a set of metrics for assessing LLM judgments for causal graph discovery independently of the downstream algorithm. Secondly, we systematically study a set of prompting designs that allows the model to specify priors about the structure of the causal graph. Finally, we present a general methodology for the integration of LLM priors in graph discovery algorithms, finding that they help improve performance on common-sense benchmarks and especially when used for assessing edge directionality. Our work highlights the potential as well as the shortcomings of the use of LLMs in this problem space.

Large Language Models are Effective Priors for Causal Graph Discovery

TL;DR

This work investigates using Large Language Models (LLMs) as soft priors to guide causal graph discovery under limited data. It introduces a probabilistic expert interaction model, a set of priors-evaluation metrics, and a flexible integration framework with CD-UCT, then empirically shows that 3-Way prompting and the combination of mutual information priors with LLM priors yield improved causal graphs, especially when computational budgets are tight. The main contributions are (i) a principled way to quantify and compare LLM-derived priors for causal discovery, (ii) a prompting Design space demonstrating when LLMs provide reliable directionality signals, and (iii) a practical integration strategy that outperforms hard priors and baseline MI priors on common-sense benchmarks. The findings highlight the potential of LLMs as soft background knowledge for causal structure learning while acknowledging limitations related to domain-specific knowledge, scalability, and possible data leakage, pointing to future work on richer interactive setups and domain-tailored LLMs.

Abstract

Causal structure discovery from observations can be improved by integrating background knowledge provided by an expert to reduce the hypothesis space. Recently, Large Language Models (LLMs) have begun to be considered as sources of prior information given the low cost of querying them relative to a human expert. In this work, firstly, we propose a set of metrics for assessing LLM judgments for causal graph discovery independently of the downstream algorithm. Secondly, we systematically study a set of prompting designs that allows the model to specify priors about the structure of the causal graph. Finally, we present a general methodology for the integration of LLM priors in graph discovery algorithms, finding that they help improve performance on common-sense benchmarks and especially when used for assessing edge directionality. Our work highlights the potential as well as the shortcomings of the use of LLMs in this problem space.
Paper Structure (17 sections, 4 figures, 4 tables)

This paper contains 17 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: High-level summary of the methodology and contributions of the present work. Firstly, we formulate a probabilistic model of expert interaction for causal graph discovery and propose a set of metrics for assessing the quality of causal judgements supplied by LLMs. Secondly, we conduct an evaluation of several LLM architectures and prompt design choices using these metrics, showing that LLM background knowledge can convincingly outperform a null model on benchmarks requiring common-sense reasoning. Lastly, we integrate LLM-derived knowledge with a recent causal discovery method, finding that it is most beneficial for assessing the more likely direction of a relationship and in scenarios in which computational budgets are low.
  • Figure 2: Results for the metrics defined over probabilistic LLM judgments. The red lines indicate the values that would be obtained by UR priors. In most cases, LLMs show better results than those obtained with UR, and hence can serve as useful priors for causal discovery. LLM performance is weaker on the Child dataset, which requires specialist domain knowledge. LLaMA models yield better TERE and TENE values at the expense of a higher LOD.
  • Figure 3: Results obtained using CD-UCT with various priors. The leftmost two and rightmost two columns show the results with computational budget multipliers of $1$ and $100$ respectively. MI and MI$\odot$LLM priors outperform the default UR priors over edges, while LLM priors by themselves do not always do so. The benefit of using priors is more pronounced with lower simulation budgets. An intermediate level of trust in the prior tends to result in the best outcome.
  • Figure 4: Metrics results for larger LLaMA models on the Asia dataset. While some improvements are exhibited, models with higher parameter counts do not lead to substantially better performance, echoing results on commonsense reasoning benchmarks.