Table of Contents
Fetching ...

Decoupled MeanFlow: Turning Flow Models into Flow Maps for Accelerated Sampling

Kyungmin Lee, Sihyun Yu, Jinwoo Shin

TL;DR

This work targets the bottleneck of slow sampling in diffusion and flow models by introducing Decoupled MeanFlow (DMF), which repurposes pretrained flow models as flow maps without architectural changes. DMF decouples encoder and decoder timesteps so the encoder uses the current time $t$ while the decoder uses the next time $r$, enabling a simple formulation $\mathbf{u}_\theta(\mathbf{x}_t,t,r) = g_\theta(f_\theta(\mathbf{x}_t,t), r)$. Training combines flow matching and mean-flow objectives, with an adaptive Cauchy loss to stabilize learning and a warmup that first trains a flow model before converting it to a DMF model. Empirically, DMF achieves 1-step FID of 2.16 (256×256) and 2.12 (512×512), and 4-step FID of 1.51 and 1.68, respectively, significantly outperforming prior methods and delivering over 100× faster inference than comparable flow-model baselines. These results highlight the importance of representation quality and encoder–decoder decoupling for efficient few-step generation, and they suggest practical pathways for post-training acceleration of diffusion and flow models in high-resolution image synthesis.

Abstract

Denoising generative models, such as diffusion and flow-based models, produce high-quality samples but require many denoising steps due to discretization error. Flow maps, which estimate the average velocity between timesteps, mitigate this error and enable faster sampling. However, their training typically demands architectural changes that limit compatibility with pretrained flow models. We introduce Decoupled MeanFlow, a simple decoding strategy that converts flow models into flow map models without architectural modifications. Our method conditions the final blocks of diffusion transformers on the subsequent timestep, allowing pretrained flow models to be directly repurposed as flow maps. Combined with enhanced training techniques, this design enables high-quality generation in as few as 1 to 4 steps. Notably, we find that training flow models and subsequently converting them is more efficient and effective than training flow maps from scratch. On ImageNet 256x256 and 512x512, our models attain 1-step FID of 2.16 and 2.12, respectively, surpassing prior art by a large margin. Furthermore, we achieve FID of 1.51 and 1.68 when increasing the steps to 4, which nearly matches the performance of flow models while delivering over 100x faster inference.

Decoupled MeanFlow: Turning Flow Models into Flow Maps for Accelerated Sampling

TL;DR

This work targets the bottleneck of slow sampling in diffusion and flow models by introducing Decoupled MeanFlow (DMF), which repurposes pretrained flow models as flow maps without architectural changes. DMF decouples encoder and decoder timesteps so the encoder uses the current time while the decoder uses the next time , enabling a simple formulation . Training combines flow matching and mean-flow objectives, with an adaptive Cauchy loss to stabilize learning and a warmup that first trains a flow model before converting it to a DMF model. Empirically, DMF achieves 1-step FID of 2.16 (256×256) and 2.12 (512×512), and 4-step FID of 1.51 and 1.68, respectively, significantly outperforming prior methods and delivering over 100× faster inference than comparable flow-model baselines. These results highlight the importance of representation quality and encoder–decoder decoupling for efficient few-step generation, and they suggest practical pathways for post-training acceleration of diffusion and flow models in high-resolution image synthesis.

Abstract

Denoising generative models, such as diffusion and flow-based models, produce high-quality samples but require many denoising steps due to discretization error. Flow maps, which estimate the average velocity between timesteps, mitigate this error and enable faster sampling. However, their training typically demands architectural changes that limit compatibility with pretrained flow models. We introduce Decoupled MeanFlow, a simple decoding strategy that converts flow models into flow map models without architectural modifications. Our method conditions the final blocks of diffusion transformers on the subsequent timestep, allowing pretrained flow models to be directly repurposed as flow maps. Combined with enhanced training techniques, this design enables high-quality generation in as few as 1 to 4 steps. Notably, we find that training flow models and subsequently converting them is more efficient and effective than training flow maps from scratch. On ImageNet 256x256 and 512x512, our models attain 1-step FID of 2.16 and 2.12, respectively, surpassing prior art by a large margin. Furthermore, we achieve FID of 1.51 and 1.68 when increasing the steps to 4, which nearly matches the performance of flow models while delivering over 100x faster inference.

Paper Structure

This paper contains 23 sections, 23 equations, 14 figures, 9 tables, 1 algorithm.

Figures (14)

  • Figure 1: Accelerating diffusion transformer via Decoupled MeanFlow. (Left) Our model, Decoupled MeanFlow (DMF), converts a flow model into a flow map by decoding the intermediate representation with next timestep $r$, while preserving the original architecture. (Right) Fine-tuning DMF-XL/2 to predict average velocity geng2025mean significantly accelerates the sampling speed of flow model (SiT-XL+REPA; yu2025representation), while maintaining the performance.
  • Figure 2: Qualitative examples. Selected samples from our DMF-XL/2+ models trained on ImageNet 512$\times$512 (top row) and ImageNet 256$\times$256 (bottom row) using NFE = 1 (left), 2 (middle), 4 (right).
  • Figure 3: Pretrained flow model as a flow map. Comparison between pretrained flow model (SiT-XL/2+REPA; yu2025representation) and converted flow map (i.e., see Fig. \ref{['fig:summary']}) with FID-50K is reported. (a) Converted DMF without fine-tuning (DMF w/o FT) outperforms SiT-XL+REPA when chosen proper decoder depth. (b) Fixing depth to 22 and varying the denoising steps, DMF w/o FT consistently outperform pretrained SiT-XL/2+REPA. (c) By freezing the encoder and fine-tuning the decoder with flow map loss with guidance, decoder-tuned DMF (DMF Decoder FT) achieves substantial gain in sampling efficiency compared to SiT-XL/2+REPA with CFG.
  • Figure 4: Effect of Flow Matching warmup. We plot 1-step FID for DMF-L/2 trained from scratch, DMF-L/2 fine-tuned from SiT-L/2 400K and 800K pretrained models. We plot total training compute used for training. We see that fine-tuned model quickly recovers 1-step performance, and DMF-L/2 fine-tuned from 800K SiT-L/2 model achieves better performance than others, while using fewer total training flops.
  • Figure 5: Euler vs. Restart samplers. We compare Euler and Restart samplers with DMF-XL/2+ trained on ImageNet 256$\times$256. FID-50K, Inception score (IS), and Fréchet distance DINOv2 ($\text{FD}_{\text{DINOv2}}$) are reported. We plot results for SiT-XL/2+REPA with CFG.
  • ...and 9 more figures