Table of Contents
Fetching ...

Posterior and variational inference for deep neural networks with heavy-tailed weights

Ismaël Castillo, Paul Egels

TL;DR

The work develops a Bayesian framework for deep neural networks with heavy-tailed priors on fixed, overfitting architectures, demonstrating near-minimax posterior contraction rates that automatically adapt to smoothness and intrinsic dimension. It extends the analysis to mean-field tempered variational approximations, showing they retain similar adaptive rates. The results cover nonparametric regression with compositional structures, geometric data characterized by Minkowski dimension, and anisotropic Besov spaces, and they remain valid when the noise level is unknown or when activation functions extend beyond ReLU. The paper also discusses practical aspects, including algorithmic avenues (MCMC and VB), prior choices, and extensions to standard posteriors via tau-augmentation, highlighting a robust, scalable Bayesian approach to uncertainty quantification in deep learning.

Abstract

We consider deep neural networks in a Bayesian framework with a prior distribution sampling the network weights at random. Following a recent idea of Agapiou and Castillo (2023), who show that heavy-tailed prior distributions achieve automatic adaptation to smoothness, we introduce a simple Bayesian deep learning prior based on heavy-tailed weights and ReLU activation. We show that the corresponding posterior distribution achieves near-optimal minimax contraction rates, simultaneously adaptive to both intrinsic dimension and smoothness of the underlying function, in a variety of contexts including nonparametric regression, geometric data and Besov spaces. While most works so far need a form of model selection built-in within the prior distribution, a key aspect of our approach is that it does not require to sample hyperparameters to learn the architecture of the network. We also provide variational Bayes counterparts of the results, that show that mean-field variational approximations still benefit from near-optimal theoretical support.

Posterior and variational inference for deep neural networks with heavy-tailed weights

TL;DR

The work develops a Bayesian framework for deep neural networks with heavy-tailed priors on fixed, overfitting architectures, demonstrating near-minimax posterior contraction rates that automatically adapt to smoothness and intrinsic dimension. It extends the analysis to mean-field tempered variational approximations, showing they retain similar adaptive rates. The results cover nonparametric regression with compositional structures, geometric data characterized by Minkowski dimension, and anisotropic Besov spaces, and they remain valid when the noise level is unknown or when activation functions extend beyond ReLU. The paper also discusses practical aspects, including algorithmic avenues (MCMC and VB), prior choices, and extensions to standard posteriors via tau-augmentation, highlighting a robust, scalable Bayesian approach to uncertainty quantification in deep learning.

Abstract

We consider deep neural networks in a Bayesian framework with a prior distribution sampling the network weights at random. Following a recent idea of Agapiou and Castillo (2023), who show that heavy-tailed prior distributions achieve automatic adaptation to smoothness, we introduce a simple Bayesian deep learning prior based on heavy-tailed weights and ReLU activation. We show that the corresponding posterior distribution achieves near-optimal minimax contraction rates, simultaneously adaptive to both intrinsic dimension and smoothness of the underlying function, in a variety of contexts including nonparametric regression, geometric data and Besov spaces. While most works so far need a form of model selection built-in within the prior distribution, a key aspect of our approach is that it does not require to sample hyperparameters to learn the architecture of the network. We also provide variational Bayes counterparts of the results, that show that mean-field variational approximations still benefit from near-optimal theoretical support.
Paper Structure (42 sections, 34 theorems, 208 equations)

This paper contains 42 sections, 34 theorems, 208 equations.

Key Result

Theorem 2

Consider data from the nonparametric random design regression model model, with $\tau_0 =1$, $f_0 \in \mathcal{G}(q, \mathbf{d}, \mathbf{t}, \boldsymbol{\beta}, K)$ and arbitrary unknown parameters. Let $\Pi$ be a heavy-tailed DNN prior as described in Section sec:htprior with Then for any $\alpha \in (0,1)$, for $M>0$ large enough, $D_\alpha$ the $\alpha$--Rényi divergence, as $n \to \infty,$

Theorems & Definitions (43)

  • Definition 1: Heavy-tailed mean-field class
  • Theorem 2
  • Corollary 3
  • Remark 4
  • Theorem 5
  • Definition 6: Minkowski Dimension
  • Theorem 7
  • Definition 8
  • Definition 9: Anisotropic Besov space
  • Theorem 10
  • ...and 33 more