Table of Contents
Fetching ...

Boosting Latent Diffusion with Flow Matching

Johannes Schusterbauer, Ming Gui, Pingchuan Ma, Nick Stracke, Stefan A. Baumann, Vincent Tao Hu, Björn Ommer

TL;DR

This workAddress the bottleneck of high-resolution image synthesis by marrying a compact diffusion model with Coupling Flow Matching (CFM) in latent space. The approach uses data-dependent couplings to transport low-resolution latent representations up to high-resolution latents, which are then decoded to pixel space, enabling 1024^2 outputs and up to 2048^2 with substantially reduced compute. By training in latent space and leveraging a fast, straighter ODE-driven flow, the method achieves competitive or superior FID/p-FID while offering faster inference than traditional diffusion baselines, and it extends naturally to degraded-image super-resolution tasks. The results demonstrate that diffusion-model diversity can be preserved at low resolution, while flow matching efficiently upscales to high resolution, providing a practical, modular path to scalable, high-quality image synthesis.

Abstract

Visual synthesis has recently seen significant leaps in performance, largely due to breakthroughs in generative models. Diffusion models have been a key enabler, as they excel in image diversity. However, this comes at the cost of slow training and synthesis, which is only partially alleviated by latent diffusion. To this end, flow matching is an appealing approach due to its complementary characteristics of faster training and inference but less diverse synthesis. We demonstrate that introducing flow matching between a frozen diffusion model and a convolutional decoder enables high-resolution image synthesis at reduced computational cost and model size. A small diffusion model can then effectively provide the necessary visual diversity, while flow matching efficiently enhances resolution and detail by mapping the small to a high-dimensional latent space. These latents are then projected to high-resolution images by the subsequent convolutional decoder of the latent diffusion approach. Combining the diversity of diffusion models, the efficiency of flow matching, and the effectiveness of convolutional decoders, state-of-the-art high-resolution image synthesis is achieved at $1024^2$ pixels with minimal computational cost. Further scaling up our method we can reach resolutions up to $2048^2$ pixels. Importantly, our approach is orthogonal to recent approximation and speed-up strategies for the underlying model, making it easily integrable into the various diffusion model frameworks.

Boosting Latent Diffusion with Flow Matching

TL;DR

This workAddress the bottleneck of high-resolution image synthesis by marrying a compact diffusion model with Coupling Flow Matching (CFM) in latent space. The approach uses data-dependent couplings to transport low-resolution latent representations up to high-resolution latents, which are then decoded to pixel space, enabling 1024^2 outputs and up to 2048^2 with substantially reduced compute. By training in latent space and leveraging a fast, straighter ODE-driven flow, the method achieves competitive or superior FID/p-FID while offering faster inference than traditional diffusion baselines, and it extends naturally to degraded-image super-resolution tasks. The results demonstrate that diffusion-model diversity can be preserved at low resolution, while flow matching efficiently upscales to high resolution, providing a practical, modular path to scalable, high-quality image synthesis.

Abstract

Visual synthesis has recently seen significant leaps in performance, largely due to breakthroughs in generative models. Diffusion models have been a key enabler, as they excel in image diversity. However, this comes at the cost of slow training and synthesis, which is only partially alleviated by latent diffusion. To this end, flow matching is an appealing approach due to its complementary characteristics of faster training and inference but less diverse synthesis. We demonstrate that introducing flow matching between a frozen diffusion model and a convolutional decoder enables high-resolution image synthesis at reduced computational cost and model size. A small diffusion model can then effectively provide the necessary visual diversity, while flow matching efficiently enhances resolution and detail by mapping the small to a high-dimensional latent space. These latents are then projected to high-resolution images by the subsequent convolutional decoder of the latent diffusion approach. Combining the diversity of diffusion models, the efficiency of flow matching, and the effectiveness of convolutional decoders, state-of-the-art high-resolution image synthesis is achieved at pixels with minimal computational cost. Further scaling up our method we can reach resolutions up to pixels. Importantly, our approach is orthogonal to recent approximation and speed-up strategies for the underlying model, making it easily integrable into the various diffusion model frameworks.
Paper Structure (28 sections, 9 equations, 19 figures, 11 tables)

This paper contains 28 sections, 9 equations, 19 figures, 11 tables.

Figures (19)

  • Figure 1: Samples synthesized in $1024^2$ px. We elevate Diffusion Models (DMs) and similar architectures to a higher-resolution domain, achieving exceptionally rapid processing speeds. We use Latent Consistency Models (LCM) luo2023lcm, distilled from SD1.5 rombach2022high_latentdiffusion_ldm and SDXL podell2023sdxl, respectively. To achieve the same resolution as LCM-SDXL, we boost LCM-SD1.5 with our Coupling Flow Matching (CFM) model. The LCM-SDXL model fails to produce competitive results within this shortened timeframe, highlighting the effectiveness of our approach in achieving both speed and quality in image synthesis.
  • Figure 2: Approach overview. a) During training we feed both a low- and a high-res image through the pre-trained encoder to obtain a low- and a high-res latent code, respectively. Based on the concatenated low-res latent code and a noisy version of it, the model regresses a vector field within $t \in [0, 1]$. b) During inference we can take any Latent Diffusion Model, generate the low-res latent, and then use our coupling flow matching model to synthesize the higher dimensional latent code. Finally, the pre-trained decoder projects the latent code back to pixel space.
  • Figure 3: Chaining our models enables elevating the image resolution from $128^2$ to $2048^2$ px. The contrast before and after upsampling is presented in the right column, with the original low-resolution image positioned in the top-right corner for reference.
  • Figure 4: Uncurated samples from the Coupling Flow Matching model on top of SD 1.5 rombach2022high_latentdiffusion_ldm using a classifier-free guidance scale of $7.5$. Samples are generated in latent space$64^2$ and up-sampled with CFM from $64^2$ to $128^2$. The resulting images have a resolution of $1024 \times 1024$ pixels. Best viewed via zoomed in.
  • Figure 5: Comparison of 1k image synthesis performance using different architectures. We utilize SD v1.5 as our base model for LDM and adapt its resolution based on jin_training-free_2023. LDM's inference time grows quadratically with higher resolutions, making real-time inference nearly impractical at a $128^2$ resolution latent space. In contrast, the integration of Coupling Flow Matching (CFM) with $50$ function evaluations exhibits consistently faster inference, highlighting its efficiency in high-resolution image synthesis.
  • ...and 14 more figures