Table of Contents
Fetching ...

Affine calculus for constrained minima of the Kullback-Leibler divergence

Giovanni Pistone

TL;DR

The paper develops a non-parametric, dually affine Information Geometry framework on the open probability simplex, formalized via a statistical bundle Sđ”Œ(Ω) and dual exponential/mixture transports, to study constrained KL minimization. It derives explicit total natural gradients for key divergences, including $D(p||q)$, the cross entropy, entropy, and Jensen–Shannon divergence, and shows how Fisher’s score becomes a moving-chart velocity within this geometry. The analysis is then specialized to product spaces, yielding principled treatments of marginalization, mean-field approximations, Kantorovich and Schrödinger transport, and variational Bayes; concrete gradient-flow forms enable systematic optimization in these settings. The framework unifies Fisherian statistics with transport and statistical physics concepts, offering a principled toolkit for gradient-based learning on non-parametric probability simplices and suggesting avenues for algorithmic development and continuous-space extensions. Key formulas include the total natural gradient grad(D)(q,r) = (−s_q(r), −η_r(q)) and the JS gradient grad(JS(q,r)) = −1/2 s_q((q+r)/2).

Abstract

The non-parametric version of Amari's dually affine Information Geometry provides a practical calculus to perform computations of interest in statistical machine learning. The method uses the notion of a statistical bundle, a mathematical structure that includes both probability densities and random variables to capture the spirit of Fisherian statistics. We focus on computations involving a constrained minimization of the Kullback-Leibler divergence. We show how to obtain neat and principled versions of known computation in applications such as mean-field approximation, adversarial generative models, and variational Bayes.

Affine calculus for constrained minima of the Kullback-Leibler divergence

TL;DR

The paper develops a non-parametric, dually affine Information Geometry framework on the open probability simplex, formalized via a statistical bundle Sđ”Œ(Ω) and dual exponential/mixture transports, to study constrained KL minimization. It derives explicit total natural gradients for key divergences, including , the cross entropy, entropy, and Jensen–Shannon divergence, and shows how Fisher’s score becomes a moving-chart velocity within this geometry. The analysis is then specialized to product spaces, yielding principled treatments of marginalization, mean-field approximations, Kantorovich and Schrödinger transport, and variational Bayes; concrete gradient-flow forms enable systematic optimization in these settings. The framework unifies Fisherian statistics with transport and statistical physics concepts, offering a principled toolkit for gradient-based learning on non-parametric probability simplices and suggesting avenues for algorithmic development and continuous-space extensions. Key formulas include the total natural gradient grad(D)(q,r) = (−s_q(r), −η_r(q)) and the JS gradient grad(JS(q,r)) = −1/2 s_q((q+r)/2).

Abstract

The non-parametric version of Amari's dually affine Information Geometry provides a practical calculus to perform computations of interest in statistical machine learning. The method uses the notion of a statistical bundle, a mathematical structure that includes both probability densities and random variables to capture the spirit of Fisherian statistics. We focus on computations involving a constrained minimization of the Kullback-Leibler divergence. We show how to obtain neat and principled versions of known computation in applications such as mean-field approximation, adversarial generative models, and variational Bayes.

Paper Structure

This paper contains 14 sections, 5 theorems, 118 equations.

Key Result

Proposition 1

The total natural gradient of the KL-divergence is That is, more explicitly, for each smooth couple of curves $t \mapsto q(t)$ and $t \mapsto r(t)$, eq:total-nat-grad-KL becomes

Theorems & Definitions (11)

  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • Proposition 4
  • proof
  • Proposition 5
  • proof : Proof of \ref{['eq:divergence-mean-field-grad-1']}
  • ...and 1 more