Table of Contents
Fetching ...

Shifted Composition III: Local Error Framework for KL Divergence

Jason M. Altschuler, Sinho Chewi

TL;DR

Shifted Composition III develops a discrete-time, KL-focused local-error framework that uses an auxiliary, shifted process to bound KL divergences between two stochastic processes driven by different kernels. By combining local (weak/strong) error analysis with a shifted Girsanov perspective, the paper provides KL guarantees for Langevin-based sampling methods across SLC, WLC, and LSI regimes, and delivers the first KL bounds for randomized midpoint discretization. The framework yields sharp rates, including the optimal $\tilde O(\sqrt{d}/\varepsilon)$ bound in SLC and LSI settings, and extends KL control to settings where Wasserstein-based analyses fail or are suboptimal. These results enable principled analysis and design of sampling algorithms in non-strongly-convex or non-Wasserstein contexts, with practical implications for high-dimensional Bayesian computation and non-Gaussian target distributions.

Abstract

Coupling arguments are a central tool for bounding the deviation between two stochastic processes, but traditionally have been limited to Wasserstein metrics. In this paper, we apply the shifted composition rule--an information-theoretic principle introduced in our earlier work--in order to adapt coupling arguments to the Kullback-Leibler (KL) divergence. Our framework combine the strengths of two previously disparate approaches: local error analysis and Girsanov's theorem. Akin to the former, it yields tight bounds by incorporating the so-called weak error, and is user-friendly in that it only requires easily verified local assumptions; and akin to the latter, it yields KL divergence guarantees and applies beyond Wasserstein contractivity. We apply this framework to the problem of sampling from a target distribution $π$. Here, the two stochastic processes are the Langevin diffusion and an algorithmic discretization thereof. Our framework provides a unified analysis when $π$ is assumed to be strongly log-concave (SLC), weakly log-concave (WLC), or to satisfy a log-Sobolev inequality (LSI). Among other results, this yields KL guarantees for the randomized midpoint discretization of the Langevin diffusion. Notably, our result: (1) yields the optimal $\tilde O(\sqrt d/ε)$ rate in the SLC and LSI settings; (2) is the first result to hold beyond the 2-Wasserstein metric in the SLC setting; and (3) is the first result to hold in \emph{any} metric in the WLC and LSI settings.

Shifted Composition III: Local Error Framework for KL Divergence

TL;DR

Shifted Composition III develops a discrete-time, KL-focused local-error framework that uses an auxiliary, shifted process to bound KL divergences between two stochastic processes driven by different kernels. By combining local (weak/strong) error analysis with a shifted Girsanov perspective, the paper provides KL guarantees for Langevin-based sampling methods across SLC, WLC, and LSI regimes, and delivers the first KL bounds for randomized midpoint discretization. The framework yields sharp rates, including the optimal bound in SLC and LSI settings, and extends KL control to settings where Wasserstein-based analyses fail or are suboptimal. These results enable principled analysis and design of sampling algorithms in non-strongly-convex or non-Wasserstein contexts, with practical implications for high-dimensional Bayesian computation and non-Gaussian target distributions.

Abstract

Coupling arguments are a central tool for bounding the deviation between two stochastic processes, but traditionally have been limited to Wasserstein metrics. In this paper, we apply the shifted composition rule--an information-theoretic principle introduced in our earlier work--in order to adapt coupling arguments to the Kullback-Leibler (KL) divergence. Our framework combine the strengths of two previously disparate approaches: local error analysis and Girsanov's theorem. Akin to the former, it yields tight bounds by incorporating the so-called weak error, and is user-friendly in that it only requires easily verified local assumptions; and akin to the latter, it yields KL divergence guarantees and applies beyond Wasserstein contractivity. We apply this framework to the problem of sampling from a target distribution . Here, the two stochastic processes are the Langevin diffusion and an algorithmic discretization thereof. Our framework provides a unified analysis when is assumed to be strongly log-concave (SLC), weakly log-concave (WLC), or to satisfy a log-Sobolev inequality (LSI). Among other results, this yields KL guarantees for the randomized midpoint discretization of the Langevin diffusion. Notably, our result: (1) yields the optimal rate in the SLC and LSI settings; (2) is the first result to hold beyond the 2-Wasserstein metric in the SLC setting; and (3) is the first result to hold in \emph{any} metric in the WLC and LSI settings.

Paper Structure

This paper contains 40 sections, 35 theorems, 191 equations, 1 figure, 1 table.

Key Result

Theorem 1.1

Let $\hat{P}$, $P$ be two Markov kernels over $\mathbb{R}^d$. Assume that for all $x,y\in\mathbb{R}^d$, there are jointly defined random variables $\hat{X} \sim \delta_x \hat{P}$, $X \sim \delta_x P$, $Y \sim \delta_y P$ satisfying the following four conditions: For any probability measures $\mu$ and $\nu$, where $\bar{N} \coloneqq N \wedge \frac{1}{{(1-L)}_+}$, $\bar{\mathcal{E}}_{\rm weak} \c

Figures (1)

  • Figure 1: Illustration of the auxiliary process constructed in §\ref{['ssec:kl-simple']}. This paper develops a framework to bound the divergence between $\{\hat{\mu}_n = \delta_x \hat{P}^n\}$ (top stochastic process) and $\{\nu_n = \delta_y P^n\}$ (bottom stochastic process). These processes differ both in that they have different initializations $x$ and $y$, and update via different Markov kernels $\hat{P}$ (purple) and $P$ (black), respectively. The auxiliary process $\{\nu_n'\}$ (blue) is constructed to interpolate between one process at initialization ($\nu_0' = \nu_0$) and the other at termination ($\nu_N' = \hat{\mu}_N$). Its update consists of two parts. First, $\nu_n'$ is shifted along the Wasserstein geodesic towards $\hat{\mu}_n$ (vertical dotted line) to produce $\hat{\nu}_n$; this brings the process closer to the interpolation criteria at termination. Second, $\nu_{n+1}'$ is produced from $\tilde{\nu}_n$ by applying the kernel $P$ (black arrow), except in the last termination where $\hat{P}$ is used to ensure the termination criterion.

Theorems & Definitions (79)

  • Theorem 1.1: Standard version of local error framework
  • Theorem 1.2: KL local error framework
  • Definition 2.1
  • Proposition 2.2: Basic properties of the KL divergence
  • Theorem 2.3: Shifted chain rule
  • Lemma 2.4: Convexity principle
  • Theorem 3.1: Simplified framework: KL analysis by coupling
  • Remark 3.2: Interpretation of the bound
  • Remark 3.3: Cross-regularity
  • Lemma 3.5: Distance recursion for the auxiliary process
  • ...and 69 more