Posterior concentrations of fully-connected Bayesian neural networks with general priors on the weights

Insung Kong; Yongdai Kim

Posterior concentrations of fully-connected Bayesian neural networks with general priors on the weights

Insung Kong, Yongdai Kim

TL;DR

This paper addresses the theoretical gap for Bayesian neural networks with i.i.d. Gaussian priors on weights by introducing a bounded-parameter, non-sparse DNN approximation theory using Leaky-ReLU activations. It proves near-minimax posterior concentration for BNNs when the true function lies in the Hölder class $\mathcal{H}_d^\beta(K)$, achieving the rate $\varepsilon_n = n^{-\beta/(2\beta+d)} \log^\gamma(n)$ with $\gamma>2$, under mild priors. The key technical advance is a bounded-parameter approximation theorem (Theorem 1) that enables Gaussian and other general priors to attain optimal-like concentration, and it is extended to nonparametric Gaussian and logistic regression, adaptive smoothness via random-width priors (Theorem 4), and hierarchical composition structures that can mitigate the curse of dimensionality (Theorem 5). Overall, the work broadens the practical applicability of BNNs by closing the gap between theory and common priors, while offering pathways to adaptivity and structured-function modeling.

Abstract

Bayesian approaches for training deep neural networks (BNNs) have received significant interest and have been effectively utilized in a wide range of applications. There have been several studies on the properties of posterior concentrations of BNNs. However, most of these studies only demonstrate results in BNN models with sparse or heavy-tailed priors. Surprisingly, no theoretical results currently exist for BNNs using Gaussian priors, which are the most commonly used one. The lack of theory arises from the absence of approximation results of Deep Neural Networks (DNNs) that are non-sparse and have bounded parameters. In this paper, we present a new approximation theory for non-sparse DNNs with bounded parameters. Additionally, based on the approximation theory, we show that BNNs with non-sparse general priors can achieve near-minimax optimal posterior concentration rates to the true model.

Posterior concentrations of fully-connected Bayesian neural networks with general priors on the weights

TL;DR

, achieving the rate

with

, under mild priors. The key technical advance is a bounded-parameter approximation theorem (Theorem 1) that enables Gaussian and other general priors to attain optimal-like concentration, and it is extended to nonparametric Gaussian and logistic regression, adaptive smoothness via random-width priors (Theorem 4), and hierarchical composition structures that can mitigate the curse of dimensionality (Theorem 5). Overall, the work broadens the practical applicability of BNNs by closing the gap between theory and common priors, while offering pathways to adaptivity and structured-function modeling.

Abstract

Paper Structure (26 sections, 24 theorems, 98 equations, 2 figures)

This paper contains 26 sections, 24 theorems, 98 equations, 2 figures.

Introduction
Preliminaries
Notation
Deep Neural Networks
Approximation results for DNNs
Posterior concentration results for BNNs
Approximation using fully-connected DNN with bounded parameters
Posterior Concentration
Sufficient condition for priors
Result on nonparametric Gaussian regression
Result on nonparametric logistic regression
Avoiding the curse of dimensionality by assuming hierarchical composition structure
Bayesian Neural Networks adaptive to Smoothness
Discussions
Proofs for Section \ref{['sec3']}
...and 11 more sections

Key Result

Theorem 1

For $\beta \in (0,\infty)$, $K \geq 1$ and $\nu \in [0,1)$, there exist positive constants $C_L, C_r, C_B$ and $c_1$ such that for every $f_0 \in \mathcal{H}_d^\beta(K)$ and every sufficiently large $M \in \mathbb{N}$, there exists $f_{\hat{\bm{\theta}}, \bm{\rho}_{\nu}}^{\operatorname{DNN}} \in \ma

Figures (2)

Figure 1: Example of hierarchical composition structure.
Figure 2: Example of Lemma \ref{['lemma_equal_network']} A DNN with depth $L$ and width $\bm{r} = (2,\dots,2,1)^{\top}$ (above) and its re-scaled DNN (below) using Lemma \ref{['lemma_equal_network']}. We use $\zeta_1 = 2^{-L}$ and $\zeta_2 = \dots = \zeta_{L+1} = 2$ for re-scaling. The two networks produce a same output for a same input.

Theorems & Definitions (31)

Definition 1
Theorem 1
Example 1: Independent prior
Example 2: Hierarchical prior
Example 3: Multivariate Gaussian prior
Theorem 2
Remark 1
Theorem 3
Definition 2: hierarchical composition structure
Theorem 4
...and 21 more

Posterior concentrations of fully-connected Bayesian neural networks with general priors on the weights

TL;DR

Abstract

Posterior concentrations of fully-connected Bayesian neural networks with general priors on the weights

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (31)