Decoupled MeanFlow: Turning Flow Models into Flow Maps for Accelerated Sampling
Kyungmin Lee, Sihyun Yu, Jinwoo Shin
TL;DR
This work targets the bottleneck of slow sampling in diffusion and flow models by introducing Decoupled MeanFlow (DMF), which repurposes pretrained flow models as flow maps without architectural changes. DMF decouples encoder and decoder timesteps so the encoder uses the current time $t$ while the decoder uses the next time $r$, enabling a simple formulation $\mathbf{u}_\theta(\mathbf{x}_t,t,r) = g_\theta(f_\theta(\mathbf{x}_t,t), r)$. Training combines flow matching and mean-flow objectives, with an adaptive Cauchy loss to stabilize learning and a warmup that first trains a flow model before converting it to a DMF model. Empirically, DMF achieves 1-step FID of 2.16 (256×256) and 2.12 (512×512), and 4-step FID of 1.51 and 1.68, respectively, significantly outperforming prior methods and delivering over 100× faster inference than comparable flow-model baselines. These results highlight the importance of representation quality and encoder–decoder decoupling for efficient few-step generation, and they suggest practical pathways for post-training acceleration of diffusion and flow models in high-resolution image synthesis.
Abstract
Denoising generative models, such as diffusion and flow-based models, produce high-quality samples but require many denoising steps due to discretization error. Flow maps, which estimate the average velocity between timesteps, mitigate this error and enable faster sampling. However, their training typically demands architectural changes that limit compatibility with pretrained flow models. We introduce Decoupled MeanFlow, a simple decoding strategy that converts flow models into flow map models without architectural modifications. Our method conditions the final blocks of diffusion transformers on the subsequent timestep, allowing pretrained flow models to be directly repurposed as flow maps. Combined with enhanced training techniques, this design enables high-quality generation in as few as 1 to 4 steps. Notably, we find that training flow models and subsequently converting them is more efficient and effective than training flow maps from scratch. On ImageNet 256x256 and 512x512, our models attain 1-step FID of 2.16 and 2.12, respectively, surpassing prior art by a large margin. Furthermore, we achieve FID of 1.51 and 1.68 when increasing the steps to 4, which nearly matches the performance of flow models while delivering over 100x faster inference.
