Distribution Transformers: Fast Approximate Bayesian Inference With On-The-Fly Prior Adaptation
George Whittle, Juliusz Ziomek, Jacob Rawling, Michael A Osborne
TL;DR
Bayesian inference faces intractable posteriors and limited prior flexibility in many real-time settings. Distribution Transformers (DTs) address this by representing priors and posteriors as Gaussian Mixture Models and learning a transformer-based mapping from priors to posteriors conditioned on observations, enabling on-the-fly prior adaptation with approximate conjugacy. DTs support prior amortization across a family of priors and preserve a structure that facilitates sequential updates in filtering contexts. Empirical results across Gaussian Processes with hyperpriors, quantum-system parameter inference, and sequential sensor fusion show that DTs achieve competitive or superior log-likelihood performance while delivering substantial speedups over SVI, PFN, TabPFN, and ACE, highlighting their practicality for real-time uncertainty quantification with flexible priors.
Abstract
While Bayesian inference provides a principled framework for reasoning under uncertainty, its widespread adoption is limited by the intractability of exact posterior computation, necessitating the use of approximate inference. However, existing methods are often computationally expensive, or demand costly retraining when priors change, limiting their utility, particularly in sequential inference problems such as real-time sensor fusion. To address these challenges, we introduce the Distribution Transformer -- a novel architecture that can learn arbitrary distribution-to-distribution mappings. Our method can be trained to map a prior to the corresponding posterior, conditioned on some dataset -- thus performing approximate Bayesian inference. Our novel architecture represents a prior distribution as a (universally-approximating) Gaussian Mixture Model (GMM), and transforms it into a GMM representation of the posterior. The components of the GMM attend to each other via self-attention, and to the datapoints via cross-attention. We demonstrate that Distribution Transformers both maintain flexibility to vary the prior, and significantly reduces computation times-from minutes to milliseconds-while achieving log-likelihood performance on par with or superior to existing approximate inference methods across tasks such as sequential inference, quantum system parameter inference, and Gaussian Process predictive posterior inference with hyperpriors.
