Table of Contents
Fetching ...

AdamMCMC: Combining Metropolis Adjusted Langevin with Momentum-based Optimization

Sebastian Bieringer, Gregor Kasieczka, Maximilian F. Steffen, Mathias Trabs

TL;DR

This work introduces a novel algorithm that quantifies epistemic uncertainty via Monte Carlo sampling from a tempered posterior distribution that combines the well established Metropolis Adjusted Langevin Algorithm with momentum-based optimization using Adam and leverages a prolate proposal distribution, to efficiently draw from the posterior.

Abstract

Uncertainty estimation is a key issue when considering the application of deep neural network methods in science and engineering. In this work, we introduce a novel algorithm that quantifies epistemic uncertainty via Monte Carlo sampling from a tempered posterior distribution. It combines the well established Metropolis Adjusted Langevin Algorithm (MALA) with momentum-based optimization using Adam and leverages a prolate proposal distribution, to efficiently draw from the posterior. We prove that the constructed chain admits the Gibbs posterior as invariant distribution and approximates this posterior in total variation distance. Furthermore, we demonstrate the efficiency of the resulting algorithm and the merit of the proposed changes on a state-of-the-art classifier from high-energy particle physics.

AdamMCMC: Combining Metropolis Adjusted Langevin with Momentum-based Optimization

TL;DR

This work introduces a novel algorithm that quantifies epistemic uncertainty via Monte Carlo sampling from a tempered posterior distribution that combines the well established Metropolis Adjusted Langevin Algorithm with momentum-based optimization using Adam and leverages a prolate proposal distribution, to efficiently draw from the posterior.

Abstract

Uncertainty estimation is a key issue when considering the application of deep neural network methods in science and engineering. In this work, we introduce a novel algorithm that quantifies epistemic uncertainty via Monte Carlo sampling from a tempered posterior distribution. It combines the well established Metropolis Adjusted Langevin Algorithm (MALA) with momentum-based optimization using Adam and leverages a prolate proposal distribution, to efficiently draw from the posterior. We prove that the constructed chain admits the Gibbs posterior as invariant distribution and approximates this posterior in total variation distance. Furthermore, we demonstrate the efficiency of the resulting algorithm and the merit of the proposed changes on a state-of-the-art classifier from high-energy particle physics.
Paper Structure (14 sections, 2 theorems, 54 equations, 5 figures, 1 algorithm)

This paper contains 14 sections, 2 theorems, 54 equations, 5 figures, 1 algorithm.

Key Result

Theorem 1

For arbitrary proposal distributions $q_{1,k}(\tau^{(k)}|\vartheta^{(k)},m^{(k+1)})$ and $q_2$ from eq:q2 the Markov chain $(\vartheta^{(k)},m^{(k)})_{k\ge1}$ admits the invariant distribution In particular, the marginal distribution of $f(\vartheta,m)$ in $\vartheta$ is the Gibbs posterior distribution $p_{\lambda}(\cdot|\mathcal{D}_{n})$.

Figures (5)

  • Figure 1: Optimization speed of AdamMCMC in comparison to Adam, SGD and sgHMC for the Top-tagging (binary classification) task. We report the mean curves over $5$ independent runs for the highest learning rate that allows stable results from a grid search. Left: Development of the cross entropy loss on the training set during running of the chain/training. To increase the readability, we show $\log$-scaling for the first $500$ steps and linear scaling as well as the a moving-average over $2400$ steps for the remaining $\approx 230$k steps. Right: Accuracy on the test set ($400$k jets) over training epochs ($2400$ steps per epoch). While AdamMCMC ($\sigma = 0.2$) closely resamples the behavior of Adam including overfitting, the samples generated with AdamMCMC ($\sigma = 2.0$) show no signs of overfitting and a similar optimization performance as sgHMC. The error bands show the $\min$-$\max$ evelope of the $5$ runs.
  • Figure 2: Dependence of the mean acceptance rate (upper) and the accuracy of the posterior mean prediction on test data (lower) on the momentum parameters of the proposal distribution $\sigma_\Delta$ (left) and Adam optimizer $\beta_{1} = \beta_{2}$ (right). For low noise runs, we find a strong dependency of the algorithm efficiency on sufficiently large momentum terms. At higher noise, this dependence is reduced due to increased width of the proposal distribution and corresponding higher probability of the backwards direction.
  • Figure 3: Dependence of the mean acceptance rate (upper) and the accuracy of the posterior mean prediction on test data (lower) on the width of the proposal distribution $\sigma$. Without directional noise (light blue), the algorithms efficiency is strongly dependent on a correct choice of $\sigma$. Applying an prolate proposal distribution however allows the algorithm to approach the deterministic optimization in the limit of low $\sigma$. Noise can then be added to guarantee sufficient parameter space exploration and achieve well-calibrated uncertainties.
  • Figure 4: Scaling of the posterior mean prediction (left) and posterior spread (right) with the proposal distribution width $\sigma$. The posterior spread is calculated as the difference between the 75%- and 25%-quantile of the prediction for $10$ posterior samples. We report individual results for both classes and true and false assignments for a test set of $400$k jets. The center line shows the median of the jets in the according category and the envelopes depict the 75%- and 25%-quantile on the data. We find a slight decrease classification performance for rising $\sigma$ for a large section of the scanned space, while for low noise values the performance of an Adam optimization is closely reproduced. The uncertainty prediction increases with increasing $\sigma$ starting out at an already significant level for $\sigma = 1$.
  • Figure 5: Comparing the loss development for AdamMCMC algorithms employing a full Metropolis-Hastings correction, as well as a stochastic approximation thereof from random initialization (left) and the end of the stochastically corrected chain (right). Both chains are calculated at the same hyperparameters, that is $\sigma=2.0$, $\sigma_\Delta = 3661.6. \lambda = 1, \beta_{1/2}=0.99$ and a learning rate of $10^{-3}$. We find no significant differences in the dynamics of the chain, although the full correction is slightly more selective.

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2