Table of Contents
Fetching ...

Convergence of flow-based generative models via proximal gradient descent in Wasserstein space

Xiuyuan Cheng, Jianfeng Lu, Yixin Tan, Yao Xie

TL;DR

The paper develops a rigorous theory for progressive, flow-based generative models by embedding the JKO (Wasserstein proximal gradient) scheme into a neural network that performs block-wise transport. It proves exponential convergence in $\mathcal{W}_2$ and a KL-based data-generation guarantee of $O(\varepsilon^2)$ with $N \lesssim \log(1/\varepsilon)$ steps, under only a finite second moment assumption, and extends to cases where the data lacks a density via short-time diffusion. The analysis also addresses inversion errors in the reverse pass, establishing a KL-$W_2$ mixed bound and showing how controlled inversion errors preserve generation quality; it further shows a forward-backward framework that can be extended to other first-order Wasserstein optimization schemes. Overall, the work provides non-asymptotic, model-agnostic convergence guarantees for progressive flow networks, bridging variational, optimal transport, and CNF perspectives, and offering theoretical scaffolding for improving stability and efficiency of flow-based generative modeling. It also suggests practical regularization and adaptive strategies to tighten the gap between theory and practice when implementing these models.

Abstract

Flow-based generative models enjoy certain advantages in computing the data generation and the likelihood, and have recently shown competitive empirical performance. Compared to the accumulating theoretical studies on related score-based diffusion models, analysis of flow-based models, which are deterministic in both forward (data-to-noise) and reverse (noise-to-data) directions, remain sparse. In this paper, we provide a theoretical guarantee of generating data distribution by a progressive flow model, the so-called JKO flow model, which implements the Jordan-Kinderleherer-Otto (JKO) scheme in a normalizing flow network. Leveraging the exponential convergence of the proximal gradient descent (GD) in Wasserstein space, we prove the Kullback-Leibler (KL) guarantee of data generation by a JKO flow model to be $O(\varepsilon^2)$ when using $N \lesssim \log (1/\varepsilon)$ many JKO steps ($N$ Residual Blocks in the flow) where $\varepsilon $ is the error in the per-step first-order condition. The assumption on data density is merely a finite second moment, and the theory extends to data distributions without density and when there are inversion errors in the reverse process where we obtain KL-$W_2$ mixed error guarantees. The non-asymptotic convergence rate of the JKO-type $W_2$-proximal GD is proved for a general class of convex objective functionals that includes the KL divergence as a special case, which can be of independent interest. The analysis framework can extend to other first-order Wasserstein optimization schemes applied to flow-based generative models.

Convergence of flow-based generative models via proximal gradient descent in Wasserstein space

TL;DR

The paper develops a rigorous theory for progressive, flow-based generative models by embedding the JKO (Wasserstein proximal gradient) scheme into a neural network that performs block-wise transport. It proves exponential convergence in and a KL-based data-generation guarantee of with steps, under only a finite second moment assumption, and extends to cases where the data lacks a density via short-time diffusion. The analysis also addresses inversion errors in the reverse pass, establishing a KL- mixed bound and showing how controlled inversion errors preserve generation quality; it further shows a forward-backward framework that can be extended to other first-order Wasserstein optimization schemes. Overall, the work provides non-asymptotic, model-agnostic convergence guarantees for progressive flow networks, bridging variational, optimal transport, and CNF perspectives, and offering theoretical scaffolding for improving stability and efficiency of flow-based generative modeling. It also suggests practical regularization and adaptive strategies to tighten the gap between theory and practice when implementing these models.

Abstract

Flow-based generative models enjoy certain advantages in computing the data generation and the likelihood, and have recently shown competitive empirical performance. Compared to the accumulating theoretical studies on related score-based diffusion models, analysis of flow-based models, which are deterministic in both forward (data-to-noise) and reverse (noise-to-data) directions, remain sparse. In this paper, we provide a theoretical guarantee of generating data distribution by a progressive flow model, the so-called JKO flow model, which implements the Jordan-Kinderleherer-Otto (JKO) scheme in a normalizing flow network. Leveraging the exponential convergence of the proximal gradient descent (GD) in Wasserstein space, we prove the Kullback-Leibler (KL) guarantee of data generation by a JKO flow model to be when using many JKO steps ( Residual Blocks in the flow) where is the error in the per-step first-order condition. The assumption on data density is merely a finite second moment, and the theory extends to data distributions without density and when there are inversion errors in the reverse process where we obtain KL- mixed error guarantees. The non-asymptotic convergence rate of the JKO-type -proximal GD is proved for a general class of convex objective functionals that includes the KL divergence as a special case, which can be of independent interest. The analysis framework can extend to other first-order Wasserstein optimization schemes applied to flow-based generative models.
Paper Structure (56 sections, 19 theorems, 119 equations, 4 figures)

This paper contains 56 sections, 19 theorems, 119 equations, 4 figures.

Key Result

Theorem 2.1

Let $\mu \in \mathcal{P}_2^r$ and $\nu \in \mathcal{P}_2$. Then

Figures (4)

  • Figure 1: The arrows indicate the forward-time flow from data distribution $P$ to normal distribution $q$. The forward and reverse processes \ref{['eq:fwd-bwd-process']} consist of the sequence of transported densities at discrete time stamps.
  • Figure 2: The monotonicity of a.g.g.-convex $G$ in $\mathcal{P}_2$ proved in Lemma \ref{['lemma:mono-G']}, as an analog to strong convexity in vector space. We remark that in the usual vector space, the convexity definition does not involve a third vector, since the inner product is uniform; while in probability space, inner product is defined at tangent space associated with $p$. The dotted line indicates the general geodesic between $\rho$ and $\pi$, see the definitions in Section \ref{['subsec:prelim-cal-P2']}.
  • Figure A.1: Computed values of $\| \nabla_{\mathcal{W}_2} F_{n+1} (p_{n+1}) \|_{p_{n+1}}^2$ from $N=2000$ samples, where $p_{n+1}$ is pushforwarded by a trained neural network transport $T$ from a Gaussian initial $p_n$ in $\mathbb{R}^2$, $n=0$. The blue line shows the value as the training progresses, and the dashed line is a base value computed from the analytical solution $p_{n+1}^{\rm true}$ (where the Wasserstein gradient vanishes).
  • Figure A.2: The Wasserstein gradient vector field $\xi$ at samples $x_i^{(1)} = T(x_i)$ (shown by green arrows), where $T$ is the trained neural network transport map, plotted as the training progresses. The yellow dots are samples $x_i^{(1)}$. The length of the arrow is proportional to the magnitude of $\| \xi(x_i^{(1)})\|$.

Theorems & Definitions (45)

  • Theorem 2.1: Brenier Theorem
  • Definition 2.2: Strong subdifferential
  • Definition 2.3: Convexity along generalized geodesics
  • Definition 3.1: Non-degenerate map
  • Lemma 3.2
  • Lemma 3.3
  • Lemma 3.4
  • Remark 3.1: Assumptions on $T_n$
  • Remark 3.2: $T_{n+1}$ and $T_n^{n+1}$
  • Lemma 4.1: Monotonicity of $G$
  • ...and 35 more