Table of Contents
Fetching ...

Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning

Guanjie Chen, Shirui Huang, Kai Liu, Jianchen Zhu, Xiaoye Qu, Peng Chen, Yu Cheng, Yifu Sun

TL;DR

Flash-DMD introduces a two-stage framework that drastically speeds up few-step diffusion generation while preserving high fidelity. The first stage employs a timestep-aware distillation that decouples distribution matching and perceptual realism, aided by a SAM-based Pixel-GAN to curb mode-seeking and a stabilized score estimator, achieving as little as $2.1\%$ of the training cost of prior methods. The second stage integrates latent reinforcement learning directly into the distillation loop, using a Latent Reward Model to guide refinement and mitigate reward hacking, yielding superior fine-grained detail and human alignment. Across score-based SDXL and flow-matching SD3-Medium, Flash-DMD attains state-of-the-art quality in few-step regimes with robust stability and generalization, making efficient, high-fidelity diffusion distillation more accessible for practical deployment.

Abstract

Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, fine-tuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement. Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only $2.1\%$ its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are coming soon.

Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning

TL;DR

Flash-DMD introduces a two-stage framework that drastically speeds up few-step diffusion generation while preserving high fidelity. The first stage employs a timestep-aware distillation that decouples distribution matching and perceptual realism, aided by a SAM-based Pixel-GAN to curb mode-seeking and a stabilized score estimator, achieving as little as of the training cost of prior methods. The second stage integrates latent reinforcement learning directly into the distillation loop, using a Latent Reward Model to guide refinement and mitigate reward hacking, yielding superior fine-grained detail and human alignment. Across score-based SDXL and flow-matching SD3-Medium, Flash-DMD attains state-of-the-art quality in few-step regimes with robust stability and generalization, making efficient, high-fidelity diffusion distillation more accessible for practical deployment.

Abstract

Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, fine-tuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement. Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are coming soon.

Paper Structure

This paper contains 43 sections, 12 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Overview of our proposed Flash-DMD. We decouple the distillation objective by timestep into a Diffusion Matching loss and an adversarial loss. During high-noise timesteps, the DMD loss enables rapid alignment with the teacher model, while at low-noise timesteps and on real images, Pixel-GAN loss is employed to enhance realism and texture details. This design achieves a more efficient distillation. Building upon this, we further introduce a reinforcement strategy specifically tailored for few-step distilled models, which seamlessly integrates with the distillation objective to achieve superior and more stable performance.
  • Figure 2: Sampling variance analysis at different time steps. The first row displays samples obtained at the 999th denoising step, while the second row corresponds to the 499th step.
  • Figure 3: Qualitative comparisons with other reinforcement approaches on SDXL. com
  • Figure 4: Evaluation results of DMD2(red) and Flash-DMD (blue) with TTUR at the ratio of 2 on SDXL.
  • Figure 5: Evaluation results of Flash-DMD (ours) with or without EMA on ImageReward, PickScore, and HPSv2. The training steps range from 1,000 to 8,000. Both models are trained with a two-time scale update rule (TTUR). The generator and the score estimator are updated at a rate of 1:2, i.e., TTUR=2.
  • ...and 8 more figures