Table of Contents
Fetching ...

Large Language Bayes

Justin Domke

TL;DR

This work introduces a novel framework called LLB that fuses large-language-model generated probabilistic models with probabilistic programming to form a joint distribution over models, data, and latent variables. By marginalizing over the space of generated models and combining per-model posteriors via data-driven weights derived from approximate marginal likelihoods, it achieves an interpretable Bayesian model averaging mechanism driven by user text and data. The authors provide a practical inference recipe leveraging self-normalized importance sampling and variationalBounds, and validate it across Rain, Coin, Polling, City Temperature, and Gold problems, showing that the approach often improves over naive flat ensembles and captures user intent. Theoretical analysis connects SNIS weights, ELBO bounds, and joint divergences, and the paper discusses limitations and future directions such as better model priors and scalable inference. Overall, the work demonstrates a principled path to turning informal problem descriptions into calibrated probabilistic predictions without committing to a single formal model.

Abstract

Many domain experts do not have the time or expertise to write formal Bayesian models. This paper takes an informal problem description as input, and combines a large language model and a probabilistic programming language to define a joint distribution over formal models, latent variables, and data. A posterior over latent variables follows by conditioning on observed data and integrating over formal models. This presents a challenging inference problem. We suggest an inference recipe that amounts to generating many formal models from the large language model, performing approximate inference on each, and then doing a weighted average. This is justified and analyzed as a combination of self-normalized importance sampling, MCMC, and importance-weighted variational inference. Experimentally, this produces sensible predictions from only data and an informal problem description, without the need to specify a formal model.

Large Language Bayes

TL;DR

This work introduces a novel framework called LLB that fuses large-language-model generated probabilistic models with probabilistic programming to form a joint distribution over models, data, and latent variables. By marginalizing over the space of generated models and combining per-model posteriors via data-driven weights derived from approximate marginal likelihoods, it achieves an interpretable Bayesian model averaging mechanism driven by user text and data. The authors provide a practical inference recipe leveraging self-normalized importance sampling and variationalBounds, and validate it across Rain, Coin, Polling, City Temperature, and Gold problems, showing that the approach often improves over naive flat ensembles and captures user intent. Theoretical analysis connects SNIS weights, ELBO bounds, and joint divergences, and the paper discusses limitations and future directions such as better model priors and scalable inference. Overall, the work demonstrates a principled path to turning informal problem descriptions into calibrated probabilistic predictions without committing to a single formal model.

Abstract

Many domain experts do not have the time or expertise to write formal Bayesian models. This paper takes an informal problem description as input, and combines a large language model and a probabilistic programming language to define a joint distribution over formal models, latent variables, and data. A posterior over latent variables follows by conditioning on observed data and integrating over formal models. This presents a challenging inference problem. We suggest an inference recipe that amounts to generating many formal models from the large language model, performing approximate inference on each, and then doing a weighted average. This is justified and analyzed as a combination of self-normalized importance sampling, MCMC, and importance-weighted variational inference. Experimentally, this produces sensible predictions from only data and an informal problem description, without the need to specify a formal model.

Paper Structure

This paper contains 51 sections, 5 theorems, 59 equations, 47 figures, 9 algorithms.

Key Result

Theorem 1

Suppose $p(z,x,m\vert t)$ and $q(z\vert x,m)$ are fixed. Then ${ \operatorname{KL}}\space\left(q(\mathsf{ z },\mathsf{ m }\vert x)\middle\Vert p(\mathsf{ z },\mathsf{ m }\vert x,t)\right)$ is minimized by with a resulting joint divergence of

Figures (47)

  • Figure 1: The basic idea. Given informal user text, an LLM generates a set of candidate formal models. Inference is performed on each and the posteriors are combined with weight proportional to the marginal likelihood. Here four (real) LLM-generated formal models in the Stan language are shown in different colors, with corresponding colors for marginal likelihoods and posteriors.
  • Figure 2: The rain problem. Left: Informal user text $t$. Top right: The given data $x$. Bottom center: Estimated marginal likelihoods $p(x\vert m^{(n)})$ and posterior means $\operatornamewithlimits{\mathbb{E}}[\mathsf{ z }\vert x,m^{(n)}]$ for each generated model $m^{(n)}$. Markers for the four models in \ref{['fig:rain-models']} are colored. Bottom right: The final posterior mean, compared to a flat average.
  • Figure 3: The coin problem. Left: Snippets from three different user prompts. Right: Resulting final posteriors, which appear to reflect user intent. Predictions from $p_{\mathrm{flat}}$ are shown as faint dotted lines.
  • Figure 4: The polling problem. Left: Observed data $x$. Right: Final estimated posterior, compared to a flat average.
  • Figure 5: Medians and 90% credible intervals for each city and test day for the the city temperature problem.
  • ...and 42 more figures

Theorems & Definitions (10)

  • Theorem 1
  • Corollary 2
  • Theorem 3
  • Corollary 4
  • proof
  • proof
  • proof
  • proof
  • Theorem 5
  • proof