Convergence of flow-based generative models via proximal gradient descent in Wasserstein space

Xiuyuan Cheng; Jianfeng Lu; Yixin Tan; Yao Xie

Convergence of flow-based generative models via proximal gradient descent in Wasserstein space

Xiuyuan Cheng, Jianfeng Lu, Yixin Tan, Yao Xie

TL;DR

The paper develops a rigorous theory for progressive, flow-based generative models by embedding the JKO (Wasserstein proximal gradient) scheme into a neural network that performs block-wise transport. It proves exponential convergence in $\mathcal{W}_2$ and a KL-based data-generation guarantee of $O(\varepsilon^2)$ with $N \lesssim \log(1/\varepsilon)$ steps, under only a finite second moment assumption, and extends to cases where the data lacks a density via short-time diffusion. The analysis also addresses inversion errors in the reverse pass, establishing a KL-$W_2$ mixed bound and showing how controlled inversion errors preserve generation quality; it further shows a forward-backward framework that can be extended to other first-order Wasserstein optimization schemes. Overall, the work provides non-asymptotic, model-agnostic convergence guarantees for progressive flow networks, bridging variational, optimal transport, and CNF perspectives, and offering theoretical scaffolding for improving stability and efficiency of flow-based generative modeling. It also suggests practical regularization and adaptive strategies to tighten the gap between theory and practice when implementing these models.

Abstract

Flow-based generative models enjoy certain advantages in computing the data generation and the likelihood, and have recently shown competitive empirical performance. Compared to the accumulating theoretical studies on related score-based diffusion models, analysis of flow-based models, which are deterministic in both forward (data-to-noise) and reverse (noise-to-data) directions, remain sparse. In this paper, we provide a theoretical guarantee of generating data distribution by a progressive flow model, the so-called JKO flow model, which implements the Jordan-Kinderleherer-Otto (JKO) scheme in a normalizing flow network. Leveraging the exponential convergence of the proximal gradient descent (GD) in Wasserstein space, we prove the Kullback-Leibler (KL) guarantee of data generation by a JKO flow model to be $O(\varepsilon^2)$ when using $N \lesssim \log (1/\varepsilon)$ many JKO steps ($N$ Residual Blocks in the flow) where $\varepsilon $ is the error in the per-step first-order condition. The assumption on data density is merely a finite second moment, and the theory extends to data distributions without density and when there are inversion errors in the reverse process where we obtain KL-$W_2$ mixed error guarantees. The non-asymptotic convergence rate of the JKO-type $W_2$-proximal GD is proved for a general class of convex objective functionals that includes the KL divergence as a special case, which can be of independent interest. The analysis framework can extend to other first-order Wasserstein optimization schemes applied to flow-based generative models.

Convergence of flow-based generative models via proximal gradient descent in Wasserstein space

TL;DR

and a KL-based data-generation guarantee of

with

steps, under only a finite second moment assumption, and extends to cases where the data lacks a density via short-time diffusion. The analysis also addresses inversion errors in the reverse pass, establishing a KL-

mixed bound and showing how controlled inversion errors preserve generation quality; it further shows a forward-backward framework that can be extended to other first-order Wasserstein optimization schemes. Overall, the work provides non-asymptotic, model-agnostic convergence guarantees for progressive flow networks, bridging variational, optimal transport, and CNF perspectives, and offering theoretical scaffolding for improving stability and efficiency of flow-based generative modeling. It also suggests practical regularization and adaptive strategies to tighten the gap between theory and practice when implementing these models.

Abstract

when using

many JKO steps (

Residual Blocks in the flow) where

is the error in the per-step first-order condition. The assumption on data density is merely a finite second moment, and the theory extends to data distributions without density and when there are inversion errors in the reverse process where we obtain KL-

mixed error guarantees. The non-asymptotic convergence rate of the JKO-type

-proximal GD is proved for a general class of convex objective functionals that includes the KL divergence as a special case, which can be of independent interest. The analysis framework can extend to other first-order Wasserstein optimization schemes applied to flow-based generative models.

Paper Structure (56 sections, 19 theorems, 119 equations, 4 figures)

This paper contains 56 sections, 19 theorems, 119 equations, 4 figures.

Introduction
Normalizing Flow models
Normalizing flow.
Progressive flow models.
Additional related works
Score-based diffusion models
SDE in diffusion models
Forward and reverse processes
Flow models related to diffusion and OT
Flow-matching models
Optimal Transport flows
Theoretical guarantees of generative models
Approximation and estimation of GAN
Guarantees of diffusion models
Guarantees of ODE flows
...and 41 more sections

Key Result

Theorem 2.1

Let $\mu \in \mathcal{P}_2^r$ and $\nu \in \mathcal{P}_2$. Then

Figures (4)

Figure 1: The arrows indicate the forward-time flow from data distribution $P$ to normal distribution $q$. The forward and reverse processes \ref{['eq:fwd-bwd-process']} consist of the sequence of transported densities at discrete time stamps.
Figure 2: The monotonicity of a.g.g.-convex $G$ in $\mathcal{P}_2$ proved in Lemma \ref{['lemma:mono-G']}, as an analog to strong convexity in vector space. We remark that in the usual vector space, the convexity definition does not involve a third vector, since the inner product is uniform; while in probability space, inner product is defined at tangent space associated with $p$. The dotted line indicates the general geodesic between $\rho$ and $\pi$, see the definitions in Section \ref{['subsec:prelim-cal-P2']}.
Figure A.1: Computed values of $\| \nabla_{\mathcal{W}_2} F_{n+1} (p_{n+1}) \|_{p_{n+1}}^2$ from $N=2000$ samples, where $p_{n+1}$ is pushforwarded by a trained neural network transport $T$ from a Gaussian initial $p_n$ in $\mathbb{R}^2$, $n=0$. The blue line shows the value as the training progresses, and the dashed line is a base value computed from the analytical solution $p_{n+1}^{\rm true}$ (where the Wasserstein gradient vanishes).
Figure A.2: The Wasserstein gradient vector field $\xi$ at samples $x_i^{(1)} = T(x_i)$ (shown by green arrows), where $T$ is the trained neural network transport map, plotted as the training progresses. The yellow dots are samples $x_i^{(1)}$. The length of the arrow is proportional to the magnitude of $\| \xi(x_i^{(1)})\|$.

Theorems & Definitions (45)

Theorem 2.1: Brenier Theorem
Definition 2.2: Strong subdifferential
Definition 2.3: Convexity along generalized geodesics
Definition 3.1: Non-degenerate map
Lemma 3.2
Lemma 3.3
Lemma 3.4
Remark 3.1: Assumptions on $T_n$
Remark 3.2: $T_{n+1}$ and $T_n^{n+1}$
Lemma 4.1: Monotonicity of $G$
...and 35 more

Convergence of flow-based generative models via proximal gradient descent in Wasserstein space

TL;DR

Abstract

Convergence of flow-based generative models via proximal gradient descent in Wasserstein space

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (45)