Martingale Posterior Neural Networks for Fast Sequential Decision Making
Gerardo Duran-Martin, Leandro Sánchez-Betancourt, Álvaro Cartea, Kevin Murphy
TL;DR
The paper addresses the challenge of fast, uncertainty-aware sequential decision making with neural networks. It proposes a predictive-first framework based on martingale posteriors, updating the one-step-ahead predictive p_{b_t}(y_{t+1}|x_{t+1}) via Kalman-filter-like recursions while decoupling decision making from parameter-space inference. The authors introduce three scalable covariance strategies—HiLoFi, LoLoFi, and LRKF—that enable online, replay-free updates for million-parameter networks and demonstrate 10–100x faster inference than classical Thompson sampling with competitive or superior performance on non-stationary contextual bandits and Bayesian optimization. The approach scales to large architectures, maintains principled uncertainty quantification through the posterior predictive, and offers practical benefits for online decision-making tasks such as recommender systems and BO. Limitations include sensitivity to linearization, absence of fixed-lag smoothing, and hyperparameter choices, with future work aimed at fully online RL extensions.
Abstract
We introduce scalable algorithms for online learning of neural network parameters and Bayesian sequential decision making. Unlike classical Bayesian neural networks, which induce predictive uncertainty through a posterior over model parameters, our methods adopt a predictive-first perspective based on martingale posteriors. In particular, we work directly with the one-step-ahead posterior predictive, which we parameterize with a neural network and update sequentially with incoming observations. This decouples Bayesian decision-making from parameter-space inference: we sample from the posterior predictive for decision making, and update the parameters of the posterior predictive via fast, frequentist Kalman-filter-like recursions. Our algorithms operate in a fully online, replay-free setting, providing principled uncertainty quantification without costly posterior sampling. Empirically, they achieve competitive performance-speed trade-offs in non-stationary contextual bandits and Bayesian optimization, offering 10-100 times faster inference than classical Thompson sampling while maintaining comparable or superior decision performance.
