Table of Contents
Fetching ...

An Inertial Langevin Algorithm

Alexander Falk, Andreas Habring, Christoph Griesbacher, Thomas Pock

TL;DR

This work introduces the Inertial Langevin Algorithm (ILA), a momentum-augmented discretization of Langevin dynamics designed to accelerate sampling from Gibbs distributions $\pi(x) \propto \exp(-U(x))$. By identifying ILA as a discretization of kinetic Langevin dynamics, the authors establish geometric ergodicity in continuous and discrete time and derive a $\mathcal{W}_2$-bias bound that scales as $\mathcal{O}(\sqrt{\Delta t})$, while enabling smaller friction parameters for faster mixing. The paper also elucidates a close link between ILA and over-relaxed Gibbs sampling, and demonstrates substantial empirical acceleration across toy, denoising, and molecular-structure-generation tasks, including high-dimensional and non-smooth settings. The combination of theoretical guarantees and broad numerical validation indicates that momentum-based sampling can significantly improve mixing and practical performance beyond traditional strongly convex regimes. Overall, ILA provides a principled, faster alternative to standard Langevin-based samplers with concrete guarantees and versatile applicability across inverse problems and machine learning tasks.

Abstract

We present a novel method for drawing samples from Gibbs distributions with densities of the form $π(x) \propto \exp(-U(x))$. The method accelerates the unadjusted Langevin algorithm by introducing an inertia term similar to Polyak's heavy ball method, together with a corresponding noise rescaling. Interpreting the scheme as a discretization of \emph{kinetic} Langevin dynamics, we prove ergodicity (in continuous and discrete time) for twice continuously differentiable, strongly convex, and $L$-smooth potentials and bound the bias of the discretization to the target in Wasserstein-2 distance. In particular, the presented proofs allow for smaller friction parameters in the kinetic Langevin diffusion compared to existing literature. Moreover, we show the close ties of the proposed method to the over-relaxed Gibbs sampler. The scheme is tested in an extensive set of numerical experiments covering simple toy examples, total variation image denoising, and the complex task of maximum likelihood learning of an energy-based model for molecular structure generation. The experimental results confirm the acceleration provided by the proposed scheme even beyond the strongly convex and $L$-smooth setting.

An Inertial Langevin Algorithm

TL;DR

This work introduces the Inertial Langevin Algorithm (ILA), a momentum-augmented discretization of Langevin dynamics designed to accelerate sampling from Gibbs distributions . By identifying ILA as a discretization of kinetic Langevin dynamics, the authors establish geometric ergodicity in continuous and discrete time and derive a -bias bound that scales as , while enabling smaller friction parameters for faster mixing. The paper also elucidates a close link between ILA and over-relaxed Gibbs sampling, and demonstrates substantial empirical acceleration across toy, denoising, and molecular-structure-generation tasks, including high-dimensional and non-smooth settings. The combination of theoretical guarantees and broad numerical validation indicates that momentum-based sampling can significantly improve mixing and practical performance beyond traditional strongly convex regimes. Overall, ILA provides a principled, faster alternative to standard Langevin-based samplers with concrete guarantees and versatile applicability across inverse problems and machine learning tasks.

Abstract

We present a novel method for drawing samples from Gibbs distributions with densities of the form . The method accelerates the unadjusted Langevin algorithm by introducing an inertia term similar to Polyak's heavy ball method, together with a corresponding noise rescaling. Interpreting the scheme as a discretization of \emph{kinetic} Langevin dynamics, we prove ergodicity (in continuous and discrete time) for twice continuously differentiable, strongly convex, and -smooth potentials and bound the bias of the discretization to the target in Wasserstein-2 distance. In particular, the presented proofs allow for smaller friction parameters in the kinetic Langevin diffusion compared to existing literature. Moreover, we show the close ties of the proposed method to the over-relaxed Gibbs sampler. The scheme is tested in an extensive set of numerical experiments covering simple toy examples, total variation image denoising, and the complex task of maximum likelihood learning of an energy-based model for molecular structure generation. The experimental results confirm the acceleration provided by the proposed scheme even beyond the strongly convex and -smooth setting.

Paper Structure

This paper contains 27 sections, 10 theorems, 111 equations, 8 figures, 2 tables, 2 algorithms.

Key Result

Lemma 3.2

Let ${\Delta t} > 0$ denote the time discretization, $\varepsilon >0$ the friction parameter, and $\theta^{-1} > 0$ the particle mass. Setting $\tau = \theta {\Delta t}^2$, $\beta = 1 - \varepsilon {\Delta t}$, the update rule of in algo:ila is a discretization of the kinetic Langevin dynamics eq:ul

Figures (8)

  • Figure 1: Sweep over momentum parameter $\beta$ on a simple 2D Gaussian distribution. Right: It can be seen that for increasing $\beta$ the process performs larger steps, leading to better exploration of the sample space. Left top: This behavior also manifests on a distribution level, where larger momenta lead to faster convergence in Wasserstein-2 distance. It is also observable that setting $\beta$ too large leads to oscillatory behavior characteristic of momentum-based optimization methods. Left bottom: With increasing momentum, successive samples within a chain become more widely spaced. Consequently, the decays faster, yielding larger effective sample sizes from a single chain.
  • Figure 1: Left: Convergence speed for sampling from the approximate Laplace distribution from \ref{['ssec:ex-bivariate-laplace']}. Right: Same evaluation on multi-modal potential introduced in \ref{['ssec:ex-gmm']}. In both plots, we see that offers the fastest convergence in $\mathcal{W}_2$-distance, followed by the other kinetic Langevin discretizations. All the methods using momentum provide significant speedup compared to . Further, it can be seen that the multi-modality of the causes significantly longer convergence times. Even after performing 100000.0 iterations, we have not numerically reached stationarity.
  • Figure 2: Top left: The distribution $\pi_0$ the starting samples were drawn from. Top right: The smooth approximation of a bivariate Laplace distribution we want to draw samples from (see \ref{['eq:laplace-potential']}). Bottom: The histogram approximations of the intermediate distribution $\pi_k$ for $k=35000.0$, obtained from the compared sampling schemes. It is observable that, out of all methods, the approximation exhibits the fewest initialization artifacts, which can be attributed to the improved theoretical bounds on step size and friction. All momentum-based sampling schemes significantly outperform the baseline.
  • Figure 3: Top left: The initial distribution $\pi_0$. Top right: The multi-modal , we aim to draw samples from. Bottom: The histogram approximations of the intermediate distribution $\pi_k$ for $k=20000.0$, obtained from the compared sampling schemes. Comparing the results obtained with , it becomes evident that momentum can help to bridge the energy gaps between disjoint modes. However, the momentum-based algorithms also struggle to get the mixture weights correct. This can be seen by focusing on the bottom-left component, which apparently is underrepresented. Even after 100000.0 iterations, there is still initialization bias left in the resulting distribution.
  • Figure 4: Left: Wasserstein convergence for sampling from an ill-conditioned multivariate Gaussian distribution. The behavior observed in the low-dimensional setting carries over to $d=100$ dimensional sampling problems: All kinetic Langevin integrators outperform . Right: Evaluation of $\mathbb{E}_{\pi_k}[U(x)]$ over the iterations. From this, we see that after 25000.0 iterations, the chains produced by , OBA, BAOAB, and seemingly have reached stationarity. All methods overestimate the exact value of $\mathbb{E}_{\pi}[U(X)] = d/2$, indicating the presence of discretization bias. Furthermore, the $\pm 1\sigma$ confidence intervals indicate that the concentration of measure is still relatively weak at $d=100$.
  • ...and 3 more figures

Theorems & Definitions (28)

  • Lemma 3.2
  • Proof 1
  • Remark 3.3
  • Lemma 3.4
  • Theorem 3.5
  • Proof 2
  • Theorem 3.6: Contraction of the continuous-time dynamics
  • Proof 3
  • Remark 3.7
  • Theorem 3.8: Ergodicity of the discrete scheme
  • ...and 18 more