Table of Contents
Fetching ...

Rapid Bayesian Computation and Estimation for Neural Networks via Log-Concave Coupling

Curtis McDonald, Andrew R. Barron

TL;DR

This paper develops a Bayesian framework for single-hidden-layer neural networks with $Kd$ interior weights and fixed outer weights, aiming for rapid, polynomial-time sampling. It introduces a log-concave coupling that represents the posterior as a mixture over an auxiliary Gaussian variable $\xi$, rendering sampling feasible via log-concave components, under suitable scaling where $Kd\ge A_{3}(\beta N)^{2}$. For a continuous uniform prior, the posterior admits this log-concave mixture form; for a discrete uniform prior, the authors derive risk bounds—namely, an arbitrary-sequence squared-regret bound of order $O(((\log d)/N)^{1/4})$ and a Kullback risk bound of order $O(((\log d)/N)^{1/3})$ under Gaussian data with $\beta=1/\sigma^{2}$. The work connects sampling efficiency with statistical risk control, outlining a pathway to polynomial-time Bayesian training of neural networks while acknowledging gaps between the continuous sampling guarantees and discrete risk guarantees. These results offer a principled framework to balance computation and statistical performance in Bayesian neural networks and suggest future directions to fuse the two prior paradigms for scalable Bayesian training.

Abstract

This paper studies a Bayesian estimation procedure for single-hidden-layer neural networks using $\ell_{1}$ controlled weights. We study the structure of the posterior density and provide a representation that makes it amenable to rapid sampling via Markov Chain Monte Carlo (MCMC), and to statistical risk guarantees. The neural network has $K$ neurons, internal weight dimension $d$, and fix the outer weights. Thus, $Kd$ parameters overall. With $N$ data observations, use a gain parameter of $β$ in the posterior density. The posterior is multimodal and not naturally suited to rapid mixing of direct MCMC algorithms. For a continuous uniform prior on the $\ell_{1}$ ball, we show that the posterior density can be written as a mixture density with suitably defined auxiliary random variables, where the mixture components are log-concave. Furthermore, when the number of model parameters $Kd$ is large enough that $Kd \geq C(βN)^{2}$, the mixing distribution of the auxiliary random variables is also log-concave. Thus, neuron parameters can be sampled from the posterior by only sampling log-concave densities. The authors refer to the mixture density as a log-concave coupling. For a discrete uniform prior restricted to a grid, we study the statistical risk (generalization error) of procedures based on the posterior. Using a gain of $β= C [(\log d)/N]^{1/4}$, we demonstrate squared error is on the order $O([(\log d)/N]^{1/4})$. Using independent Gaussian data with a variance $σ^{2} $ that matches the inverse gain, $β= 1/σ^{2}$, we show that the expected Kullback divergence has a cube root power $O([(\log d)/N]^{1/3})$. Future work aims to bridge the sampling ability of the continuous uniform prior with the risk control of the discrete uniform prior, resulting in a polynomial time Bayesian training algorithm for neural networks with statistical risk control.

Rapid Bayesian Computation and Estimation for Neural Networks via Log-Concave Coupling

TL;DR

This paper develops a Bayesian framework for single-hidden-layer neural networks with interior weights and fixed outer weights, aiming for rapid, polynomial-time sampling. It introduces a log-concave coupling that represents the posterior as a mixture over an auxiliary Gaussian variable , rendering sampling feasible via log-concave components, under suitable scaling where . For a continuous uniform prior, the posterior admits this log-concave mixture form; for a discrete uniform prior, the authors derive risk bounds—namely, an arbitrary-sequence squared-regret bound of order and a Kullback risk bound of order under Gaussian data with . The work connects sampling efficiency with statistical risk control, outlining a pathway to polynomial-time Bayesian training of neural networks while acknowledging gaps between the continuous sampling guarantees and discrete risk guarantees. These results offer a principled framework to balance computation and statistical performance in Bayesian neural networks and suggest future directions to fuse the two prior paradigms for scalable Bayesian training.

Abstract

This paper studies a Bayesian estimation procedure for single-hidden-layer neural networks using controlled weights. We study the structure of the posterior density and provide a representation that makes it amenable to rapid sampling via Markov Chain Monte Carlo (MCMC), and to statistical risk guarantees. The neural network has neurons, internal weight dimension , and fix the outer weights. Thus, parameters overall. With data observations, use a gain parameter of in the posterior density. The posterior is multimodal and not naturally suited to rapid mixing of direct MCMC algorithms. For a continuous uniform prior on the ball, we show that the posterior density can be written as a mixture density with suitably defined auxiliary random variables, where the mixture components are log-concave. Furthermore, when the number of model parameters is large enough that , the mixing distribution of the auxiliary random variables is also log-concave. Thus, neuron parameters can be sampled from the posterior by only sampling log-concave densities. The authors refer to the mixture density as a log-concave coupling. For a discrete uniform prior restricted to a grid, we study the statistical risk (generalization error) of procedures based on the posterior. Using a gain of , we demonstrate squared error is on the order . Using independent Gaussian data with a variance that matches the inverse gain, , we show that the expected Kullback divergence has a cube root power . Future work aims to bridge the sampling ability of the continuous uniform prior with the risk control of the discrete uniform prior, resulting in a polynomial time Bayesian training algorithm for neural networks with statistical risk control.

Paper Structure

This paper contains 24 sections, 32 theorems, 262 equations.

Key Result

Theorem 1

Let the neural network have inner weight dimension $d \geq 2$ and $K\geq 2$ neurons with $N$ data observations $(x_{i}, y_{i})_{i=1}^{N}$. Assume $\beta N \geq 2$. Define the values Define a value Let $d$ and $K$ satisfy and Using a continuous uniform prior on $(S^{d}_{1})^{K}$, for each $n \leq N$ the posterior distribution $p_{n}(w)$ can be written as a mixture distribution with an auxiliary

Theorems & Definitions (70)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Remark 1
  • Theorem 5
  • proof
  • Lemma 1
  • proof
  • Lemma 2
  • ...and 60 more