Rapid Bayesian Computation and Estimation for Neural Networks via Log-Concave Coupling
Curtis McDonald, Andrew R. Barron
TL;DR
This paper develops a Bayesian framework for single-hidden-layer neural networks with $Kd$ interior weights and fixed outer weights, aiming for rapid, polynomial-time sampling. It introduces a log-concave coupling that represents the posterior as a mixture over an auxiliary Gaussian variable $\xi$, rendering sampling feasible via log-concave components, under suitable scaling where $Kd\ge A_{3}(\beta N)^{2}$. For a continuous uniform prior, the posterior admits this log-concave mixture form; for a discrete uniform prior, the authors derive risk bounds—namely, an arbitrary-sequence squared-regret bound of order $O(((\log d)/N)^{1/4})$ and a Kullback risk bound of order $O(((\log d)/N)^{1/3})$ under Gaussian data with $\beta=1/\sigma^{2}$. The work connects sampling efficiency with statistical risk control, outlining a pathway to polynomial-time Bayesian training of neural networks while acknowledging gaps between the continuous sampling guarantees and discrete risk guarantees. These results offer a principled framework to balance computation and statistical performance in Bayesian neural networks and suggest future directions to fuse the two prior paradigms for scalable Bayesian training.
Abstract
This paper studies a Bayesian estimation procedure for single-hidden-layer neural networks using $\ell_{1}$ controlled weights. We study the structure of the posterior density and provide a representation that makes it amenable to rapid sampling via Markov Chain Monte Carlo (MCMC), and to statistical risk guarantees. The neural network has $K$ neurons, internal weight dimension $d$, and fix the outer weights. Thus, $Kd$ parameters overall. With $N$ data observations, use a gain parameter of $β$ in the posterior density. The posterior is multimodal and not naturally suited to rapid mixing of direct MCMC algorithms. For a continuous uniform prior on the $\ell_{1}$ ball, we show that the posterior density can be written as a mixture density with suitably defined auxiliary random variables, where the mixture components are log-concave. Furthermore, when the number of model parameters $Kd$ is large enough that $Kd \geq C(βN)^{2}$, the mixing distribution of the auxiliary random variables is also log-concave. Thus, neuron parameters can be sampled from the posterior by only sampling log-concave densities. The authors refer to the mixture density as a log-concave coupling. For a discrete uniform prior restricted to a grid, we study the statistical risk (generalization error) of procedures based on the posterior. Using a gain of $β= C [(\log d)/N]^{1/4}$, we demonstrate squared error is on the order $O([(\log d)/N]^{1/4})$. Using independent Gaussian data with a variance $σ^{2} $ that matches the inverse gain, $β= 1/σ^{2}$, we show that the expected Kullback divergence has a cube root power $O([(\log d)/N]^{1/3})$. Future work aims to bridge the sampling ability of the continuous uniform prior with the risk control of the discrete uniform prior, resulting in a polynomial time Bayesian training algorithm for neural networks with statistical risk control.
