Table of Contents
Fetching ...

On the Natural Gradient of the Evidence Lower Bound

Nihat Ay, Jesse van Oostrum, Adwait Datar

TL;DR

This work analyzes how the Fisher-Rao (natural) gradient behaves for the evidence lower bound (ELBO) in variational inference. By adopting an information-geometric perspective, it relates the ELBO optimization on an extended space with hidden units to learning the target distribution on the visible space, and identifies a cylindrical-model condition under which the natural gradient of ELBO coincides with the natural gradient of the evidence. The core results show that, for cylindrical models, the variational gap has no effect on learning and the ELBO gradient maps to the evidence gradient; for non-cylindrical models, this invariance can fail, motivating geometric criteria for preserved equivalence. The findings offer theoretical justification for using natural-gradient-based ELBO optimization in full or cylindrical settings and clarify gradient behavior in Bayesian graphical models via tangent-space decompositions.

Abstract

This article studies the Fisher-Rao gradient, also referred to as the natural gradient, of the evidence lower bound (ELBO) which plays a central role in generative machine learning. It reveals that the gap between the evidence and its lower bound, the ELBO, has essentially a vanishing natural gradient within unconstrained optimization. As a result, maximization of the ELBO is equivalent to minimization of the Kullback-Leibler divergence from a target distribution, the primary objective function of learning. Building on this insight, we derive a condition under which this equivalence persists even when optimization is constrained to a model. This condition yields a geometric characterization, which we formalize through the notion of a cylindrical model.

On the Natural Gradient of the Evidence Lower Bound

TL;DR

This work analyzes how the Fisher-Rao (natural) gradient behaves for the evidence lower bound (ELBO) in variational inference. By adopting an information-geometric perspective, it relates the ELBO optimization on an extended space with hidden units to learning the target distribution on the visible space, and identifies a cylindrical-model condition under which the natural gradient of ELBO coincides with the natural gradient of the evidence. The core results show that, for cylindrical models, the variational gap has no effect on learning and the ELBO gradient maps to the evidence gradient; for non-cylindrical models, this invariance can fail, motivating geometric criteria for preserved equivalence. The findings offer theoretical justification for using natural-gradient-based ELBO optimization in full or cylindrical settings and clarify gradient behavior in Bayesian graphical models via tangent-space decompositions.

Abstract

This article studies the Fisher-Rao gradient, also referred to as the natural gradient, of the evidence lower bound (ELBO) which plays a central role in generative machine learning. It reveals that the gap between the evidence and its lower bound, the ELBO, has essentially a vanishing natural gradient within unconstrained optimization. As a result, maximization of the ELBO is equivalent to minimization of the Kullback-Leibler divergence from a target distribution, the primary objective function of learning. Building on this insight, we derive a condition under which this equivalence persists even when optimization is constrained to a model. This condition yields a geometric characterization, which we formalize through the notion of a cylindrical model.
Paper Structure (9 sections, 7 theorems, 136 equations, 10 figures)

This paper contains 9 sections, 7 theorems, 136 equations, 10 figures.

Key Result

Proposition 1

Let ${\mathcal{M}}$ be a model in $\mathcal{P}$, let ${\mathcal{L}}: {\mathcal{P}}_V \to {\mathbb R}$ be a differentiable objective function, and let $p \in {\mathcal{M}}$ be an admissible point. Then, Furthermore, if one of the two gradients does not vanish, we have

Figures (10)

  • Figure 1: Illustration of a cylindrical model ${\mathcal{M}}$ in terms of a cylinder, the Cartesian product of a circle with a finite interval. The tangent space $T_p {\mathcal{M}}$ equals the sum of its intersections with ${\mathcal{H}}_p$ and ${\mathcal{V}}_p$.
  • Figure 2: Illustration of the gradients considered in Theorem \ref{['graddeppr']}.
  • Figure 3: Graphical representations of the models $\mathcal{M}^{(a)}$ and $\mathcal{M}^{(b)}$.
  • Figure 4: Illustration of gradients considered in Theorem \ref{['mainthdist']}.
  • Figure 5: The blue grid represents the set of independent probability distributions over two random variables corresponding to the cylindrical model $\mathcal{M}^{(a)}_V$, as a subset of $\mathcal{P}_V$. The cylindrical model $\mathcal{M}^{(a)}$ is defined in \ref{['eq:cylindrical_model_def']} and depicted in Figure \ref{['fig:3NodeModels']}. The solid black line shows the overlapping trajectories $\sigma_0$, $\sigma_1$ and $\sigma_2$, where $\sigma_0$ satisfies $\dot\sigma_0(t) = -{\rm grad}_{\sigma_0(t)}^{{\mathcal{M}}^{(a)}_V} D(p^\ast \| \cdot)$, $\sigma_1=\pi_V \circ \gamma_1$ is the projection of the negative gradient curve $\gamma_1$ satisfying $\dot\gamma_1(t) = -{\rm grad}^{{\mathcal{M}}^{(a)}}_{\gamma_1(t)} \, D(\mathcal{Q} \| \cdot )$ and $\sigma_2=\pi_V \circ \gamma_2$ is the projection of the negative gradient curve $\gamma_2$ satisfying $\dot\gamma_2(t) = - {\rm grad}^{{\mathcal{M}}^{(a)}}_{\gamma_2(t)} \, D(q \| \cdot)$. Theorem \ref{['graddeppr']} and \ref{['mainthdist']} imply that $\sigma_0$, $\sigma_1$ and $\sigma_2$ are identical.
  • ...and 5 more figures

Theorems & Definitions (10)

  • Proposition 1
  • Definition 2: Definition 1 of ay2020locality
  • Theorem 3: Theorem 5 of ay2020locality
  • Lemma 4
  • Theorem 5
  • Theorem 6
  • Corollary 7
  • Remark 8
  • Remark 9
  • Proposition 10