Table of Contents
Fetching ...

Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation

Zilyu Ye, Zhiyang Chen, Tiancheng Li, Zemin Huang, Weijian Luo, Guo-Jun Qi

TL;DR

Diffusion models rely on fixed noise schedules $t_n$, which can limit per-prompt image quality and efficiency. The authors propose Time Prediction Diffusion Model (TPDM), which attaches a Time Prediction Module (TPM) to a frozen diffusion backbone to predict the next diffusion time at each step, modeling a Beta-distributed decay rate with $ _n \sim \text{Beta}(\alpha_n,\beta_n)$ and $t_n = r_n t_{n-1}$, where $\alpha_n = 1+e^{a_n}$ and $\beta_n = 1+e^{b_n}$. TPDM is trained with Proximal Policy Optimization (PPO) to maximize a final-image reward aligned with human preferences and discounted by the number of steps, encouraging high quality with fewer denoising steps. Experiments on Stable Diffusion 3 Medium, Stable Diffusion 3 Large, and Flux demonstrate TPDM can reduce inference steps by about 50% while achieving competitive or superior metrics (e.g., FID, CLIP-T, Aesthetic v2, Pick Score) and higher human-preference scores (HPS).

Abstract

Diffusion and flow matching models have achieved remarkable success in text-to-image generation. However, these models typically rely on the predetermined denoising schedules for all prompts. The multi-step reverse diffusion process can be regarded as a kind of chain-of-thought for generating high-quality images step by step. Therefore, diffusion models should reason for each instance to adaptively determine the optimal noise schedule, achieving high generation quality with sampling efficiency. In this paper, we introduce the Time Prediction Diffusion Model (TPDM) for this. TPDM employs a plug-and-play Time Prediction Module (TPM) that predicts the next noise level based on current latent features at each denoising step. We train the TPM using reinforcement learning to maximize a reward that encourages high final image quality while penalizing excessive denoising steps. With such an adaptive scheduler, TPDM not only generates high-quality images that are aligned closely with human preferences but also adjusts diffusion time and the number of denoising steps on the fly, enhancing both performance and efficiency. With Stable Diffusion 3 Medium architecture, TPDM achieves an aesthetic score of 5.44 and a human preference score (HPS) of 29.59, while using around 50% fewer denoising steps to achieve better performance.

Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation

TL;DR

Diffusion models rely on fixed noise schedules , which can limit per-prompt image quality and efficiency. The authors propose Time Prediction Diffusion Model (TPDM), which attaches a Time Prediction Module (TPM) to a frozen diffusion backbone to predict the next diffusion time at each step, modeling a Beta-distributed decay rate with and , where and . TPDM is trained with Proximal Policy Optimization (PPO) to maximize a final-image reward aligned with human preferences and discounted by the number of steps, encouraging high quality with fewer denoising steps. Experiments on Stable Diffusion 3 Medium, Stable Diffusion 3 Large, and Flux demonstrate TPDM can reduce inference steps by about 50% while achieving competitive or superior metrics (e.g., FID, CLIP-T, Aesthetic v2, Pick Score) and higher human-preference scores (HPS).

Abstract

Diffusion and flow matching models have achieved remarkable success in text-to-image generation. However, these models typically rely on the predetermined denoising schedules for all prompts. The multi-step reverse diffusion process can be regarded as a kind of chain-of-thought for generating high-quality images step by step. Therefore, diffusion models should reason for each instance to adaptively determine the optimal noise schedule, achieving high generation quality with sampling efficiency. In this paper, we introduce the Time Prediction Diffusion Model (TPDM) for this. TPDM employs a plug-and-play Time Prediction Module (TPM) that predicts the next noise level based on current latent features at each denoising step. We train the TPM using reinforcement learning to maximize a reward that encourages high final image quality while penalizing excessive denoising steps. With such an adaptive scheduler, TPDM not only generates high-quality images that are aligned closely with human preferences but also adjusts diffusion time and the number of denoising steps on the fly, enhancing both performance and efficiency. With Stable Diffusion 3 Medium architecture, TPDM achieves an aesthetic score of 5.44 and a human preference score (HPS) of 29.59, while using around 50% fewer denoising steps to achieve better performance.

Paper Structure

This paper contains 31 sections, 7 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Samples generated by TPDM-FLUX.1-dev showcase stunning visual effects while adaptively adjusting inference steps based on the target output. The number in the lower right corner of each image indicates the inference steps used.
  • Figure 2: The Inference Process of TPDMs: The horizontal axis represents diffusion time, ranging from 1 to 0. The image starts from random noise $x_{t_0}$ and is progressively denoised until a clean image $x_{t_N}$. Meanwhile, the reward is calculated for the final image and discounted by $\gamma$ to influence previous steps.
  • Figure 3: The architecture of TPDM involves a frozen Diffusion Models, and a plug-and-play Time Prediction Module.
  • Figure 4: From left to right, the images generated by TPDM-FLUX1.0-dev progress from simple to complex. Our Time Prediction Module adaptively adjusts the generation schedule to suit the complexity of each generation target.
  • Figure 5: Our TPDM-SD3-Medium, when compared to the SD3-Medium with the recommended and equivalent number of steps, demonstrates superior detail processing ability and generation accuracy.
  • ...and 6 more figures