Scalability of Bayesian Network Structure Elicitation with Large Language Models: a Novel Methodology and Comparative Analysis
Nikolay Babakov, Ehud Reiter, Alberto Bugarin
TL;DR
This work addresses the problem of eliciting Bayesian Network structures without data by leveraging multiple Large Language Models (LLMs) in a Delphi-style, expert-aggregation framework. The authors introduce a novel method where an initial facilitator generates diverse LLM expert profiles, each expert independently reasons about possible causal edges, and final structure is formed by majority voting, with cycle-resolution prompts to handle conflicts. The method is evaluated against a baseline Harness approach across BNs of varying sizes, with a data-contamination test to assess whether LLMs have seen target BN structures during training; results show improvements over the baseline for at least one LLM (GPT-3.5) but reveal substantial scalability challenges as BN size grows and some BNs are unsuitable due to ambiguity or contamination. Overall, the study highlights the potential of LLM-driven BN elicitation while underscoring the need for input disambiguation, contamination checks, and larger-context or domain-specialist-enabled prompting to achieve reliable scalability in structure learning.
Abstract
In this work, we propose a novel method for Bayesian Networks (BNs) structure elicitation that is based on the initialization of several LLMs with different experiences, independently querying them to create a structure of the BN, and further obtaining the final structure by majority voting. We compare the method with one alternative method on various widely and not widely known BNs of different sizes and study the scalability of both methods on them. We also propose an approach to check the contamination of BNs in LLM, which shows that some widely known BNs are inapplicable for testing the LLM usage for BNs structure elicitation. We also show that some BNs may be inapplicable for such experiments because their node names are indistinguishable. The experiments on the other BNs show that our method performs better than the existing method with one of the three studied LLMs; however, the performance of both methods significantly decreases with the increase in BN size.
