Table of Contents
Fetching ...

Posterior concentrations of fully-connected Bayesian neural networks with general priors on the weights

Insung Kong, Yongdai Kim

TL;DR

This paper addresses the theoretical gap for Bayesian neural networks with i.i.d. Gaussian priors on weights by introducing a bounded-parameter, non-sparse DNN approximation theory using Leaky-ReLU activations. It proves near-minimax posterior concentration for BNNs when the true function lies in the Hölder class $\mathcal{H}_d^\beta(K)$, achieving the rate $\varepsilon_n = n^{-\beta/(2\beta+d)} \log^\gamma(n)$ with $\gamma>2$, under mild priors. The key technical advance is a bounded-parameter approximation theorem (Theorem 1) that enables Gaussian and other general priors to attain optimal-like concentration, and it is extended to nonparametric Gaussian and logistic regression, adaptive smoothness via random-width priors (Theorem 4), and hierarchical composition structures that can mitigate the curse of dimensionality (Theorem 5). Overall, the work broadens the practical applicability of BNNs by closing the gap between theory and common priors, while offering pathways to adaptivity and structured-function modeling.

Abstract

Bayesian approaches for training deep neural networks (BNNs) have received significant interest and have been effectively utilized in a wide range of applications. There have been several studies on the properties of posterior concentrations of BNNs. However, most of these studies only demonstrate results in BNN models with sparse or heavy-tailed priors. Surprisingly, no theoretical results currently exist for BNNs using Gaussian priors, which are the most commonly used one. The lack of theory arises from the absence of approximation results of Deep Neural Networks (DNNs) that are non-sparse and have bounded parameters. In this paper, we present a new approximation theory for non-sparse DNNs with bounded parameters. Additionally, based on the approximation theory, we show that BNNs with non-sparse general priors can achieve near-minimax optimal posterior concentration rates to the true model.

Posterior concentrations of fully-connected Bayesian neural networks with general priors on the weights

TL;DR

This paper addresses the theoretical gap for Bayesian neural networks with i.i.d. Gaussian priors on weights by introducing a bounded-parameter, non-sparse DNN approximation theory using Leaky-ReLU activations. It proves near-minimax posterior concentration for BNNs when the true function lies in the Hölder class , achieving the rate with , under mild priors. The key technical advance is a bounded-parameter approximation theorem (Theorem 1) that enables Gaussian and other general priors to attain optimal-like concentration, and it is extended to nonparametric Gaussian and logistic regression, adaptive smoothness via random-width priors (Theorem 4), and hierarchical composition structures that can mitigate the curse of dimensionality (Theorem 5). Overall, the work broadens the practical applicability of BNNs by closing the gap between theory and common priors, while offering pathways to adaptivity and structured-function modeling.

Abstract

Bayesian approaches for training deep neural networks (BNNs) have received significant interest and have been effectively utilized in a wide range of applications. There have been several studies on the properties of posterior concentrations of BNNs. However, most of these studies only demonstrate results in BNN models with sparse or heavy-tailed priors. Surprisingly, no theoretical results currently exist for BNNs using Gaussian priors, which are the most commonly used one. The lack of theory arises from the absence of approximation results of Deep Neural Networks (DNNs) that are non-sparse and have bounded parameters. In this paper, we present a new approximation theory for non-sparse DNNs with bounded parameters. Additionally, based on the approximation theory, we show that BNNs with non-sparse general priors can achieve near-minimax optimal posterior concentration rates to the true model.
Paper Structure (26 sections, 24 theorems, 98 equations, 2 figures)

This paper contains 26 sections, 24 theorems, 98 equations, 2 figures.

Key Result

Theorem 1

For $\beta \in (0,\infty)$, $K \geq 1$ and $\nu \in [0,1)$, there exist positive constants $C_L, C_r, C_B$ and $c_1$ such that for every $f_0 \in \mathcal{H}_d^\beta(K)$ and every sufficiently large $M \in \mathbb{N}$, there exists $f_{\hat{\bm{\theta}}, \bm{\rho}_{\nu}}^{\operatorname{DNN}} \in \ma

Figures (2)

  • Figure 1: Example of hierarchical composition structure.
  • Figure 2: Example of Lemma \ref{['lemma_equal_network']} A DNN with depth $L$ and width $\bm{r} = (2,\dots,2,1)^{\top}$ (above) and its re-scaled DNN (below) using Lemma \ref{['lemma_equal_network']}. We use $\zeta_1 = 2^{-L}$ and $\zeta_2 = \dots = \zeta_{L+1} = 2$ for re-scaling. The two networks produce a same output for a same input.

Theorems & Definitions (31)

  • Definition 1
  • Theorem 1
  • Example 1: Independent prior
  • Example 2: Hierarchical prior
  • Example 3: Multivariate Gaussian prior
  • Theorem 2
  • Remark 1
  • Theorem 3
  • Definition 2: hierarchical composition structure
  • Theorem 4
  • ...and 21 more