Table of Contents
Fetching ...

Entropy-MCMC: Sampling from Flat Basins with Ease

Bolian Li, Ruqi Zhang

TL;DR

The paper tackles the difficulty of multi-modal posterior distributions in Bayesian neural networks by prioritizing flat basins during sampling. It introduces Entropy-MCMC, which couples the model parameters with an auxiliary guiding variable drawn from a smoothed local-entropy posterior, enabling efficient sampling via a simple joint distribution whose gradients bias toward flat regions. The authors establish convergence guarantees in the strongly convex setting and show faster convergence than prior flatness-aware methods, with extensive experiments demonstrating improved classification, calibration, and OOD detection. This approach offers a practical, drop-in MCMC method that enhances posterior exploration and generalization under realistic compute budgets.

Abstract

Bayesian deep learning counts on the quality of posterior distribution estimation. However, the posterior of deep neural networks is highly multi-modal in nature, with local modes exhibiting varying generalization performance. Given a practical budget, targeting at the original posterior can lead to suboptimal performance, as some samples may become trapped in "bad" modes and suffer from overfitting. Leveraging the observation that "good" modes with low generalization error often reside in flat basins of the energy landscape, we propose to bias sampling on the posterior toward these flat regions. Specifically, we introduce an auxiliary guiding variable, the stationary distribution of which resembles a smoothed posterior free from sharp modes, to lead the MCMC sampler to flat basins. By integrating this guiding variable with the model parameter, we create a simple joint distribution that enables efficient sampling with minimal computational overhead. We prove the convergence of our method and further show that it converges faster than several existing flatness-aware methods in the strongly convex setting. Empirical results demonstrate that our method can successfully sample from flat basins of the posterior, and outperforms all compared baselines on multiple benchmarks including classification, calibration, and out-of-distribution detection.

Entropy-MCMC: Sampling from Flat Basins with Ease

TL;DR

The paper tackles the difficulty of multi-modal posterior distributions in Bayesian neural networks by prioritizing flat basins during sampling. It introduces Entropy-MCMC, which couples the model parameters with an auxiliary guiding variable drawn from a smoothed local-entropy posterior, enabling efficient sampling via a simple joint distribution whose gradients bias toward flat regions. The authors establish convergence guarantees in the strongly convex setting and show faster convergence than prior flatness-aware methods, with extensive experiments demonstrating improved classification, calibration, and OOD detection. This approach offers a practical, drop-in MCMC method that enhances posterior exploration and generalization under realistic compute budgets.

Abstract

Bayesian deep learning counts on the quality of posterior distribution estimation. However, the posterior of deep neural networks is highly multi-modal in nature, with local modes exhibiting varying generalization performance. Given a practical budget, targeting at the original posterior can lead to suboptimal performance, as some samples may become trapped in "bad" modes and suffer from overfitting. Leveraging the observation that "good" modes with low generalization error often reside in flat basins of the energy landscape, we propose to bias sampling on the posterior toward these flat regions. Specifically, we introduce an auxiliary guiding variable, the stationary distribution of which resembles a smoothed posterior free from sharp modes, to lead the MCMC sampler to flat basins. By integrating this guiding variable with the model parameter, we create a simple joint distribution that enables efficient sampling with minimal computational overhead. We prove the convergence of our method and further show that it converges faster than several existing flatness-aware methods in the strongly convex setting. Empirical results demonstrate that our method can successfully sample from flat basins of the posterior, and outperforms all compared baselines on multiple benchmarks including classification, calibration, and out-of-distribution detection.
Paper Structure (47 sections, 5 theorems, 37 equations, 13 figures, 10 tables, 1 algorithm)

This paper contains 47 sections, 5 theorems, 37 equations, 13 figures, 10 tables, 1 algorithm.

Key Result

Lemma 1

Assume $\widetilde{\bm{\theta}}=[\bm{\theta}^T,\bm{\theta}_a^T]^T\in\mathbb{R}^{2d}$ and $\widetilde{\bm{\theta}}$ has the following distribution: Then the marginal distributions of $\bm{\theta}$ and $\bm{\theta}_a$ are the original posterior $p(\bm{\theta}|\mathcal{D})$ and $p(\bm{\theta}_a|\mathcal{D})$ (Eq. eq:theta_a_posterior). Further, the density $p(\widetilde{\bm{\theta}}|\mathcal{D})$ in

Figures (13)

  • Figure 1: Illustration of Entropy-MCMC. (a) shows how the guiding variable $\bm{{\theta}}_a$ pulls $\bm{\theta}$ toward flat basins; (b) shows two posterior distributions, where $p(\bm{\theta}_a|\mathcal{D})$ is a smoothed distribution transformed from $p(\bm{\theta}|\mathcal{D})$, and only keeps flat modes. Entropy-MCMC prioritizes flat modes by leveraging the guiding variable $\bm{\theta}_a$ from the smoothed posterior as a form of regularization.
  • Figure 2: Sampling trajectories on a synthetic energy landscape with sharp (lower left) and flat (top right) modes. The initial point is located at the ridge of two modes. EMCMC successfully biases toward the flat mode whereas SGD and SGLD are trapped in the sharp mode.
  • Figure 3: Logistic regression on MNIST in terms of training NLL and testing accuracy (repeated 10 times). EMCMC converges faster than others, which is consistent with our theoretical analysis.
  • Figure 4: Eigenspectrum of Hessian matrices of ResNet18 on CIFAR100. $x$-axis: eigenvalue, $y$-axis: frequency. A nearly all-zero eigenspectrum indicates a local mode that is flat in all directions. EMCMC successfully finds such flat modes with significantly smaller eigenvalues.
  • Figure 5: Parameter space interpolation of ResNet18 on CIFAR100. Exploring the neighborhood of local modes from $\bm{\theta}$ to (a)-(b): a random direction in the parameter space, and (c): $\bm{\theta}_a$. (a) and (b) show that EMCMC has the lowest and the most flat NLL and error curves. (c) shows that $\bm{\theta}$ and $\bm{\theta}_a$ converge to the same flat mode while maintaining diversity.
  • ...and 8 more figures

Theorems & Definitions (9)

  • Lemma 1
  • Lemma 2
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • proof
  • proof
  • proof
  • proof