Table of Contents
Fetching ...

Fast and Unified Path Gradient Estimators for Normalizing Flows

Lorenz Vaitl, Ludwig Winkler, Lorenz Richter, Pan Kessel

TL;DR

The paper addresses the efficiency barrier of path gradient estimators for normalizing flows by introducing fast, unified path gradient estimators that work across practical NF architectures. It derives a recursive forward-pass formulation to compute the necessary path-score derivatives, enabling efficient estimation for both coupling and implicitly invertible flows, with linear-time complexity in the coupling case and constant-memory usage. By leveraging a pullback perspective, the forward KL gradient is expressed as a reverse KL-like path gradient, allowing fast, low-variance maximum-likelihood training that can incorporate a target energy function as regularization. Empirical results on Gaussian mixtures and lattice gauge theories demonstrate reduced variance and improved convergence, while runtime analyses show significant speedups over prior methods, broadening the applicability of path-gradient NF training in physics and ML contexts.

Abstract

Recent work shows that path gradient estimators for normalizing flows have lower variance compared to standard estimators for variational inference, resulting in improved training. However, they are often prohibitively more expensive from a computational point of view and cannot be applied to maximum likelihood training in a scalable manner, which severely hinders their widespread adoption. In this work, we overcome these crucial limitations. Specifically, we propose a fast path gradient estimator which improves computational efficiency significantly and works for all normalizing flow architectures of practical relevance. We then show that this estimator can also be applied to maximum likelihood training for which it has a regularizing effect as it can take the form of a given target energy function into account. We empirically establish its superior performance and reduced variance for several natural sciences applications.

Fast and Unified Path Gradient Estimators for Normalizing Flows

TL;DR

The paper addresses the efficiency barrier of path gradient estimators for normalizing flows by introducing fast, unified path gradient estimators that work across practical NF architectures. It derives a recursive forward-pass formulation to compute the necessary path-score derivatives, enabling efficient estimation for both coupling and implicitly invertible flows, with linear-time complexity in the coupling case and constant-memory usage. By leveraging a pullback perspective, the forward KL gradient is expressed as a reverse KL-like path gradient, allowing fast, low-variance maximum-likelihood training that can incorporate a target energy function as regularization. Empirical results on Gaussian mixtures and lattice gauge theories demonstrate reduced variance and improved convergence, while runtime analyses show significant speedups over prior methods, broadening the applicability of path-gradient NF training in physics and ML contexts.

Abstract

Recent work shows that path gradient estimators for normalizing flows have lower variance compared to standard estimators for variational inference, resulting in improved training. However, they are often prohibitively more expensive from a computational point of view and cannot be applied to maximum likelihood training in a scalable manner, which severely hinders their widespread adoption. In this work, we overcome these crucial limitations. Specifically, we propose a fast path gradient estimator which improves computational efficiency significantly and works for all normalizing flow architectures of practical relevance. We then show that this estimator can also be applied to maximum likelihood training for which it has a regularizing effect as it can take the form of a given target energy function into account. We empirically establish its superior performance and reduced variance for several natural sciences applications.
Paper Structure (26 sections, 8 theorems, 74 equations, 10 figures, 6 tables, 3 algorithms)

This paper contains 26 sections, 8 theorems, 74 equations, 10 figures, 6 tables, 3 algorithms.

Key Result

Proposition 3.1

Using the diffeomorphism $T_l$, the derivative of the induced probability can be computed recursively as follows

Figures (10)

  • Figure 1: Effective sample size (ESS) over the training iterations for a Gaussian mixture model using the forward and the reverse KL divergence. The intervals denote the standard error over $5$ runs. The best performance is indicated by a dot with subsequent faded average performance in the left and center figure. For the forward KL, we compare multiple hyperparameter settings (see \ref{['app: computational details']}) and plot the respective best runs in the central plot. The right plot displays a stereotypical dependency on the data set size for fixed hyperparameters, see Tables \ref{['tab:sweep_2linlayers']}, \ref{['tab:sweep_4linlayers']} and \ref{['tab:sweep_6linlayers']} for more details. We can see that, typically, path gradients perform better than standard maximum likelihood gradients.
  • Figure 2: Training the $U(1)$ flow for Lattice Gauge Theory. Shaded area shows standard error over 4 runs. The Reverse KL Path Gradients reach higher performance and exhibit less erratic behavior.
  • Figure 3: Gradient norm during training of the $\phi^4$-experiments. The norm of the path gradient estimator is closer to zero than the norm of the standard gradient estimator when the target density is well approximated, indicating lower variance.
  • Figure 4: The gradient of the loss function $\text{d}_\theta \mathcal{L}(\theta)$ consists of the path gradient (blue) and a score term (red) --- the latter vanishes in expectation, but has non-vanishing variance. The path gradient framework computes the necessary quantities to perform stochastic gradient descent with only the path gradients, eliminating the impact of the score term which tends to increase the variance of the gradient leading to suboptimal gradient estimation. We propose a fast algorithm for computing the Path Gradient estimator which evaluates $\partial_{x_\theta} \log q_\theta(x_\theta)$ (green) alongside the forward pass. The term $\partial_{x_\theta} \log p(x_\theta)$ is assumed to be available for a given energy function within the problem formulation. We can then compute the derivative of the log ratio, $\partial_{x_\theta} \log\left( p(x_\theta)/q_\theta(x_\theta)\right)$, which can be interpreted as a scaling of the gradient $\partial_\theta x_\theta$, the gradient with respect to the parameters. Thus the gradient $\text{d}_\theta \mathcal{L}(\theta)$ only consist of the path gradient, eliminating the negative influence of the score term.
  • Figure 5: The $\operatorname{ESS}_p$ of a RealNVP flow with two linear layers in each of its six couplings blocks trained with Forward KL Gradients shown in red and Forward KL Path Gradients shown in blue. For higher model capacity and larger batch sizes, the Forward KL Path Gradients achieve higher absolute $\operatorname{ESS}_p$ while the Forward KL Gradients collapse with increasing model capacity. The rows increase width of the linear layers (hidden neurons) and the columns increase the batch size used during optimization.
  • ...and 5 more figures

Theorems & Definitions (13)

  • Definition 3.1
  • Proposition 3.1: Gradient recursion
  • Proposition 3.1: Recursive gradient computations for coupling flows
  • Corollary 3.1: Recursive gradient computations for affine coupling flows
  • Proposition 4.0: Path gradient for forward KL
  • Proposition B.0: Gradient recursion
  • proof
  • Proposition B.0: Recursive gradient computations for coupling flows
  • proof
  • Corollary B.0: Recursive gradient computations for affine coupling flows
  • ...and 3 more