Convergence Analysis of Flow Matching in Latent Space with Transformers
Yuling Jiao, Yanming Lai, Yang Wang, Bokai Yan
TL;DR
This work provides an end-to-end statistical learning theory for flow matching in latent spaces using transformers. By embedding high-dimensional data into a latent space via a pre-trained autoencoder and modeling the velocity field with Lipschitz transformers, the authors establish convergence in Wasserstein-2 distance between the generated and target distributions, accounting for pre-training domain shift and reconstruction error. They prove universal approximation results for Lipschitz transformers, derive generalization and discretization bounds for velocity-field estimation, and quantify the end-to-end error introduced by autoencoder pre-training, yielding precise conditions under which convergence is guaranteed. The results offer theoretical justification for latent-space ODE-based generative models with transformers and guide practical deployment by clarifying the roles of pre-training, discretization, and Lipschitz regularity in achieving reliable sample generation.
Abstract
We present theoretical convergence guarantees for ODE-based generative models, specifically flow matching. We use a pre-trained autoencoder network to map high-dimensional original inputs to a low-dimensional latent space, where a transformer network is trained to predict the velocity field of the transformation from a standard normal distribution to the target latent distribution. Our error analysis demonstrates the effectiveness of this approach, showing that the distribution of samples generated via estimated ODE flow converges to the target distribution in the Wasserstein-2 distance under mild and practical assumptions. Furthermore, we show that arbitrary smooth functions can be effectively approximated by transformer networks with Lipschitz continuity, which may be of independent interest.
