Table of Contents
Fetching ...

An Equivalence between Bayesian Priors and Penalties in Variational Inference

Pierre Wolinski, Guillaume Charpiat, Yann Ollivier

TL;DR

The paper addresses the common practice of adding penalties in VI by providing a rigorous link between penalties a priori and Bayesian priors. It proves a main theorem that, under suitable conditions, any penalty can be represented as a KL divergence to a log-prior with an explicit Fourier-based construction for the prior density, and it offers a closed-form formula for the prior when the penalty is admissible. Through a suite of illustrative examples, the work shows that L2 penalties induce Gaussian priors, that cosine penalties yield oscillatory priors, and that L1 penalties are incompatible with smooth VI posteriors, among other findings; it also demonstrates that penalties that decompose over parameter blocks yield product-form priors. Beyond theory, the authors develop a practical heuristic to select the penalty strength in neural networks by matching the implied prior variance to reasonable initialization schemes, and they explore the relation to the cold posterior effect. Overall, the work provides a principled framework for designing penalties within VI that preserve a Bayesian interpretation and offers actionable guidance for hyperparameter tuning and prior specification in deep learning contexts.

Abstract

In machine learning, it is common to optimize the parameters of a probabilistic model, modulated by an ad hoc regularization term that penalizes some values of the parameters. Regularization terms appear naturally in Variational Inference, a tractable way to approximate Bayesian posteriors: the loss to optimize contains a Kullback--Leibler divergence term between the approximate posterior and a Bayesian prior. We fully characterize the regularizers that can arise according to this procedure, and provide a systematic way to compute the prior corresponding to a given penalty. Such a characterization can be used to discover constraints over the penalty function, so that the overall procedure remains Bayesian.

An Equivalence between Bayesian Priors and Penalties in Variational Inference

TL;DR

The paper addresses the common practice of adding penalties in VI by providing a rigorous link between penalties a priori and Bayesian priors. It proves a main theorem that, under suitable conditions, any penalty can be represented as a KL divergence to a log-prior with an explicit Fourier-based construction for the prior density, and it offers a closed-form formula for the prior when the penalty is admissible. Through a suite of illustrative examples, the work shows that L2 penalties induce Gaussian priors, that cosine penalties yield oscillatory priors, and that L1 penalties are incompatible with smooth VI posteriors, among other findings; it also demonstrates that penalties that decompose over parameter blocks yield product-form priors. Beyond theory, the authors develop a practical heuristic to select the penalty strength in neural networks by matching the implied prior variance to reasonable initialization schemes, and they explore the relation to the cold posterior effect. Overall, the work provides a principled framework for designing penalties within VI that preserve a Bayesian interpretation and offers actionable guidance for hyperparameter tuning and prior specification in deep learning contexts.

Abstract

In machine learning, it is common to optimize the parameters of a probabilistic model, modulated by an ad hoc regularization term that penalizes some values of the parameters. Regularization terms appear naturally in Variational Inference, a tractable way to approximate Bayesian posteriors: the loss to optimize contains a Kullback--Leibler divergence term between the approximate posterior and a Bayesian prior. We fully characterize the regularizers that can arise according to this procedure, and provide a systematic way to compute the prior corresponding to a given penalty. Such a characterization can be used to discover constraints over the penalty function, so that the overall procedure remains Bayesian.

Paper Structure

This paper contains 46 sections, 12 theorems, 90 equations, 1 figure.

Key Result

Theorem 1

We assume that the variational family $(\beta_{\boldsymbol{\mu}, \boldsymbol{\nu}})_{\boldsymbol{\mu}, \boldsymbol{\nu}}$ fulfills Assumption assum:1. Let $r_{\boldsymbol{\nu}}(\boldsymbol{\mu})$ be a penalty over $\beta_{\boldsymbol{\mu}, \boldsymbol{\nu}}$. We assume that, for all $\boldsymbol{\nu Then, there exists a unique function $A_{\boldsymbol{\nu}} \in \mathcal{S}'(\mathbb{R}^N)$ such tha

Figures (1)

  • Figure 1: Test NLL obtained at the epoch where the validation NLL is optimal, for various penalty factors $\bar{\lambda}$. The quality of our heuristics can be estimated by measuring the closeness of the minimum of the blue curve to the solid orange line.

Theorems & Definitions (28)

  • Example 1
  • Example 2
  • Theorem 1
  • proof
  • Remark 1
  • Lemma 1
  • Lemma 2
  • Corollary 1
  • Proposition 1
  • proof
  • ...and 18 more