Table of Contents
Fetching ...

BIRD: A Trustworthy Bayesian Inference Framework for Large Language Models

Yu Feng, Ben Zhou, Weidong Lin, Dan Roth

TL;DR

BIRD tackles unreliable probability estimation in large language models under partial information by coupling abductive factor generation with a Bayesian network, then refining conditional probabilities through constrained optimization and LLM entailment to compute reliable P(O|C). The framework produces interpretable, language-based Bayesian variables and demonstrates significant improvements in probability calibration and decision-making across diverse reasoning tasks. Extrinsic work shows Bird-derived probabilities can serve as supervision signals to improve smaller models, while follow-up-question generation highlights its utility for trust-aware, interactive decision support. Overall, Bird advances trustworthy AI by providing a transparent, data-efficient approach to probabilistic inference in LLM-driven applications.

Abstract

Predictive models often need to work with incomplete information in real-world tasks. Consequently, they must provide reliable probability or confidence estimation, especially in large-scale decision-making and planning tasks. Current large language models (LLMs) are insufficient for accurate estimations, but they can generate relevant factors that may affect the probabilities, produce coarse-grained probabilities when the information is more complete, and help determine which factors are relevant to specific downstream contexts. In this paper, we make use of these capabilities of LLMs to provide a significantly more accurate probabilistic estimation. We propose BIRD, a novel probabilistic inference framework that aligns a Bayesian network with LLM abductions and then estimates more accurate probabilities in a deduction step. We show BIRD provides reliable probability estimations that are 30% better than those provided directly by LLM baselines. These estimates further contribute to better and more trustworthy decision making.

BIRD: A Trustworthy Bayesian Inference Framework for Large Language Models

TL;DR

BIRD tackles unreliable probability estimation in large language models under partial information by coupling abductive factor generation with a Bayesian network, then refining conditional probabilities through constrained optimization and LLM entailment to compute reliable P(O|C). The framework produces interpretable, language-based Bayesian variables and demonstrates significant improvements in probability calibration and decision-making across diverse reasoning tasks. Extrinsic work shows Bird-derived probabilities can serve as supervision signals to improve smaller models, while follow-up-question generation highlights its utility for trust-aware, interactive decision support. Overall, Bird advances trustworthy AI by providing a transparent, data-efficient approach to probabilistic inference in LLM-driven applications.

Abstract

Predictive models often need to work with incomplete information in real-world tasks. Consequently, they must provide reliable probability or confidence estimation, especially in large-scale decision-making and planning tasks. Current large language models (LLMs) are insufficient for accurate estimations, but they can generate relevant factors that may affect the probabilities, produce coarse-grained probabilities when the information is more complete, and help determine which factors are relevant to specific downstream contexts. In this paper, we make use of these capabilities of LLMs to provide a significantly more accurate probabilistic estimation. We propose BIRD, a novel probabilistic inference framework that aligns a Bayesian network with LLM abductions and then estimates more accurate probabilities in a deduction step. We show BIRD provides reliable probability estimations that are 30% better than those provided directly by LLM baselines. These estimates further contribute to better and more trustworthy decision making.
Paper Structure (26 sections, 11 equations, 15 figures, 9 tables, 1 algorithm)

This paper contains 26 sections, 11 equations, 15 figures, 9 tables, 1 algorithm.

Figures (15)

  • Figure 1: An example of mission-critical tasks. We first ask OpenAI o1 to separately predict the probability of building a charging station at each specific location. It estimates the same probabilities for two different conditions twice, while Bird (ours) successfully distinguishes the difference and can thus help the user make a more informed decision. This further demonstrates that while LLMs are capable of coarse estimations, they struggle to generate accurate probabilities. We further show through OpenAI o1 ranking that although it can successfully rank all four locations based on how likely it should be used to build a new charging station, it is still insufficient to complete the task because of ties, i.e., (1,4) and (2,3) are the same from ranking perspectives.
  • Figure 2: Overview of Bird. Given a scenario $S$, LLMs generate the factors $F$ (a/b/...) with potential values ($f_1 \in$ {a.1,a.2}). Bird approximates a Bayesian network parameterized by $\mathbb{P}(O_i|f_j)$, and optimizes by sampling LLM coarse predictions on $\mathbb{P}_\mathrm{LLM}(O_i|f), f \in \mathcal{F}$ ($\mathcal{F}$ is the set of all value combinations of $F$, e.g., $f={\rm ( a.1,b.2,c.1,d.2,e.1,f.2)}$), and minimizing the distributional distance between an approximated $\mathbb{P}_\mathrm{estimated}(O_i|f)$ and $\mathbb{P}_\mathrm{LLM}(O_i|f)$. At inference time, each context $C$ (S+U1/S+U2/...) is mapped to some $f_j$ via entailment, and a probability is derived using $\mathbb{P}_\mathrm{estimated}$. Bird can further generate follow-up questions.
  • Figure 3: The interface for human evaluation on preference-based pairwise evaluation of the estimated probabilities.
  • Figure 4: The interface for human evaluation on preference-based pairwise evaluation of the generated questions. Question 1 is generated by Bird and question 2 is generated by LLM directly. The three annotators all prefer question 1.
  • Figure 5: Example Additional Sentence Sampling Prompt
  • ...and 10 more figures