Optimising Distributions with Natural Gradient Surrogates

Jonathan So; Richard E. Turner

Optimising Distributions with Natural Gradient Surrogates

Jonathan So, Richard E. Turner

TL;DR

This work tackles the challenge of computing natural gradients for distribution-parameter optimization by reframing the problem in terms of a surrogate distribution $\tilde{q}$ with easy NGD computations. It formalises surrogate natural gradient descent (SNGD), proves equivalence under suitable conditions, and introduces exponential-family surrogates with auxiliary-parameter extensions to broaden applicability. The authors show that several existing NGD methods are instances of SNGD and demonstrate substantial speedups across MLE and VI tasks, including negative-binomial, skew-elliptical, elliptical-copula, and mixture models. The approach is easy to implement with standard autodiff, scalable, and general, offering a practical pathway to leverage natural gradients on a wider class of distributions and problems.

Abstract

Natural gradient methods have been used to optimise the parameters of probability distributions in a variety of settings, often resulting in fast-converging procedures. Unfortunately, for many distributions of interest, computing the natural gradient has a number of challenges. In this work we propose a novel technique for tackling such issues, which involves reframing the optimisation as one with respect to the parameters of a surrogate distribution, for which computing the natural gradient is easy. We give several examples of existing methods that can be interpreted as applying this technique, and propose a new method for applying it to a wide variety of problems. Our method expands the set of distributions that can be efficiently targeted with natural gradients. Furthermore, it is fast, easy to understand, simple to implement using standard autodiff software, and does not require lengthy model-specific derivations. We demonstrate our method on maximum likelihood estimation and variational inference tasks.

Optimising Distributions with Natural Gradient Surrogates

TL;DR

This work tackles the challenge of computing natural gradients for distribution-parameter optimization by reframing the problem in terms of a surrogate distribution

with easy NGD computations. It formalises surrogate natural gradient descent (SNGD), proves equivalence under suitable conditions, and introduces exponential-family surrogates with auxiliary-parameter extensions to broaden applicability. The authors show that several existing NGD methods are instances of SNGD and demonstrate substantial speedups across MLE and VI tasks, including negative-binomial, skew-elliptical, elliptical-copula, and mixture models. The approach is easy to implement with standard autodiff, scalable, and general, offering a practical pathway to leverage natural gradients on a wider class of distributions and problems.

Abstract

Paper Structure (43 sections, 8 theorems, 57 equations, 20 figures, 3 tables, 3 algorithms)

This paper contains 43 sections, 8 theorems, 57 equations, 20 figures, 3 tables, 3 algorithms.

INTRODUCTION
BACKGROUND
Natural Gradient Descent
Exponential Family Distributions
METHOD
Surrogate Natural Gradient Descent
Choice of Surrogate
Equivalence with Optimisation of $f$
Exponential Familly Surrogates
Auxiliary Parameters
RESULTS
Negative Binomial Distribution
Skew-Elliptical Distributions
Elliptical Copulas
Mixture Distributions
...and 28 more sections

Key Result

Proposition 1

$\tilde{f}$ has a local minimum at $\tilde{\theta}^*$ if and only if $f$ has a local minimum at $\theta^* = g(\tilde{\theta}^*)$

Figures (20)

Figure 1: Negative binomial MLE on the sheep dataset. (left) Training curves. (right) Histogram of the observed counts, overlaid with the PDF of the gamma surrogate of SNGD at convergence.
Figure 2: Training curves for MLE on the miniboone dataset ($n$=32,840, $d$=43) using (left) skew-normal, and (right) skew-$t$ distributions.
Figure 3: Training curves for Bayesian logistic regression VI on the covertype dataset ($n$=500, $d$=53) using (left) skew-normal, and (right) skew-$t$ approximations. MCEF corresponds to the natural gradient VI method of lin2020mcef.
Figure 4: Training curves for MLE on a synthetic dataset ($n$=10,000, $d$=1,000) using (left) skew-normal, and (right) skew-$t$ distributions.
Figure 5: $t$-copula MLE using 5 years of daily stock return data ($n$=1,515, $d$=93). (left) Training curves. (right) Contours of a 2D marginal density from the fitted copula, overlaid with the training data.
...and 15 more figures

Theorems & Definitions (15)

Proposition 1
proof
Proposition 2
proof
Proposition 3
proof
Proposition 4
proof
Proposition 5
proof
...and 5 more

Optimising Distributions with Natural Gradient Surrogates

TL;DR

Abstract

Optimising Distributions with Natural Gradient Surrogates

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (20)

Theorems & Definitions (15)