Boosting Latent Diffusion with Flow Matching
Johannes Schusterbauer, Ming Gui, Pingchuan Ma, Nick Stracke, Stefan A. Baumann, Vincent Tao Hu, Björn Ommer
TL;DR
This workAddress the bottleneck of high-resolution image synthesis by marrying a compact diffusion model with Coupling Flow Matching (CFM) in latent space. The approach uses data-dependent couplings to transport low-resolution latent representations up to high-resolution latents, which are then decoded to pixel space, enabling 1024^2 outputs and up to 2048^2 with substantially reduced compute. By training in latent space and leveraging a fast, straighter ODE-driven flow, the method achieves competitive or superior FID/p-FID while offering faster inference than traditional diffusion baselines, and it extends naturally to degraded-image super-resolution tasks. The results demonstrate that diffusion-model diversity can be preserved at low resolution, while flow matching efficiently upscales to high resolution, providing a practical, modular path to scalable, high-quality image synthesis.
Abstract
Visual synthesis has recently seen significant leaps in performance, largely due to breakthroughs in generative models. Diffusion models have been a key enabler, as they excel in image diversity. However, this comes at the cost of slow training and synthesis, which is only partially alleviated by latent diffusion. To this end, flow matching is an appealing approach due to its complementary characteristics of faster training and inference but less diverse synthesis. We demonstrate that introducing flow matching between a frozen diffusion model and a convolutional decoder enables high-resolution image synthesis at reduced computational cost and model size. A small diffusion model can then effectively provide the necessary visual diversity, while flow matching efficiently enhances resolution and detail by mapping the small to a high-dimensional latent space. These latents are then projected to high-resolution images by the subsequent convolutional decoder of the latent diffusion approach. Combining the diversity of diffusion models, the efficiency of flow matching, and the effectiveness of convolutional decoders, state-of-the-art high-resolution image synthesis is achieved at $1024^2$ pixels with minimal computational cost. Further scaling up our method we can reach resolutions up to $2048^2$ pixels. Importantly, our approach is orthogonal to recent approximation and speed-up strategies for the underlying model, making it easily integrable into the various diffusion model frameworks.
