Table of Contents
Fetching ...

Inclusive KL Minimization: A Wasserstein-Fisher-Rao Gradient Flow Perspective

Jia-Jie Zhu

TL;DR

It is shown that a general-purpose approximate inclusive KL inference paradigm can be constructed using the theory of gradient flows derived from PDE analysis, and it is uncovered that several existing learning algorithms can be viewed as particular realizations of the inclusive KL inference paradigm.

Abstract

Otto's (2001) Wasserstein gradient flow of the exclusive KL divergence functional provides a powerful and mathematically principled perspective for analyzing learning and inference algorithms. In contrast, algorithms for the inclusive KL inference, i.e., minimizing $ \mathrm{KL}(π\| μ) $ with respect to $ μ$ for some target $ π$, are rarely analyzed using tools from mathematical analysis. This paper shows that a general-purpose approximate inclusive KL inference paradigm can be constructed using the theory of gradient flows derived from PDE analysis. We uncover that several existing learning algorithms can be viewed as particular realizations of the inclusive KL inference paradigm. For example, existing sampling algorithms such as Arbel et al. (2019) and Korba et al. (2021) can be viewed in a unified manner as inclusive-KL inference with approximate gradient estimators. Finally, we provide the theoretical foundation for the Wasserstein-Fisher-Rao gradient flows for minimizing the inclusive KL divergence.

Inclusive KL Minimization: A Wasserstein-Fisher-Rao Gradient Flow Perspective

TL;DR

It is shown that a general-purpose approximate inclusive KL inference paradigm can be constructed using the theory of gradient flows derived from PDE analysis, and it is uncovered that several existing learning algorithms can be viewed as particular realizations of the inclusive KL inference paradigm.

Abstract

Otto's (2001) Wasserstein gradient flow of the exclusive KL divergence functional provides a powerful and mathematically principled perspective for analyzing learning and inference algorithms. In contrast, algorithms for the inclusive KL inference, i.e., minimizing with respect to for some target , are rarely analyzed using tools from mathematical analysis. This paper shows that a general-purpose approximate inclusive KL inference paradigm can be constructed using the theory of gradient flows derived from PDE analysis. We uncover that several existing learning algorithms can be viewed as particular realizations of the inclusive KL inference paradigm. For example, existing sampling algorithms such as Arbel et al. (2019) and Korba et al. (2021) can be viewed in a unified manner as inclusive-KL inference with approximate gradient estimators. Finally, we provide the theoretical foundation for the Wasserstein-Fisher-Rao gradient flows for minimizing the inclusive KL divergence.

Paper Structure

This paper contains 17 sections, 10 theorems, 77 equations, 1 figure.

Key Result

Theorem 3.1

Suppose that initial condition satisfies $\pi \ll \mu$, i.e., $\pi$ is absolutely continuous with respect to $\mu$. Then, eq:kernelized-gfe-reverseKL coincides with the Wasserstein gradient flow equation of the MMD eq:wgf-mmd-pde,

Figures (1)

  • Figure 1: Illustration of the generator $\varphi$ of the exclusive and inclusive KL divergences.

Theorems & Definitions (20)

  • Definition 2.1: Gradient system ottoGeometryDissipativeEvolution2001mielke2023introduction
  • Theorem 3.1: Flow equation \ref{['eq:vanilla-wasserstein-rkl-gfe']} has a Wasserstein gradient structure
  • Remark 3.2
  • Remark 3.3: Approximation limit
  • Remark 3.4: Stein gradient flow of inclusive KL
  • Corollary 3.5: Formal equivalence between KSD-WGF and inclusive KL inference
  • Proposition 4.1: FR gradient flow of inclusive KL
  • Theorem 4.2: Exponential Decay of inclusive-KL divergence
  • Proposition 4.3
  • Proposition 4.4: Variational principle for inclusive-KL
  • ...and 10 more