Table of Contents
Fetching ...

AsyncDiff: Asynchronous Timestep Conditioning for Enhanced Text-to-Image Diffusion Inference

Longhuan Xu, Feng Yin, Cunjian Chen

TL;DR

The paper addresses inefficiencies in diffusion-based text-to-image generation by decoupling the denoiser conditioning timestep from the image update schedule. It introduces a lightweight timestep prediction module trained with GRPO to adaptively select conditioning timesteps, enabling controllable de-synchronization via a single deployment scaling parameter. Empirical results show consistent improvements in multiple metrics (ImageReward, HPSv2, CLIP, PickScore), with larger gains for stronger backbones and fewer steps, particularly on Flux. The work highlights trade-offs among metrics and calls for more robust evaluation of high-frequency artifacts, proposing future work to broaden schedulers and datasets while refining loss signals.

Abstract

Text-to-image diffusion inference typically follows synchronized schedules, where the numerical integrator advances the latent state to the same timestep at which the denoiser is conditioned. We propose an asynchronous inference mechanism that decouples these two, allowing the denoiser to be conditioned at a different, learned timestep while keeping image update schedule unchanged. A lightweight timestep prediction module (TPM), trained with Group Relative Policy Optimization (GRPO), selects a more feasible conditioning timestep based on the current state, effectively choosing a desired noise level to control image detail and textural richness. At deployment, a scaling hyper-parameter can be used to interpolate between the original and de-synchronized timesteps, enabling conservative or aggressive adjustments. To keep the study computationally affordable, we cap the inference at 15 steps for SD3.5 and 10 steps for Flux. Evaluated on Stable Diffusion 3.5 Medium and Flux.1-dev across MS-COCO 2014 and T2I-CompBench datasets, our method optimizes a composite reward that averages Image Reward, HPSv2, CLIP Score and Pick Score, and shows consistent improvement.

AsyncDiff: Asynchronous Timestep Conditioning for Enhanced Text-to-Image Diffusion Inference

TL;DR

The paper addresses inefficiencies in diffusion-based text-to-image generation by decoupling the denoiser conditioning timestep from the image update schedule. It introduces a lightweight timestep prediction module trained with GRPO to adaptively select conditioning timesteps, enabling controllable de-synchronization via a single deployment scaling parameter. Empirical results show consistent improvements in multiple metrics (ImageReward, HPSv2, CLIP, PickScore), with larger gains for stronger backbones and fewer steps, particularly on Flux. The work highlights trade-offs among metrics and calls for more robust evaluation of high-frequency artifacts, proposing future work to broaden schedulers and datasets while refining loss signals.

Abstract

Text-to-image diffusion inference typically follows synchronized schedules, where the numerical integrator advances the latent state to the same timestep at which the denoiser is conditioned. We propose an asynchronous inference mechanism that decouples these two, allowing the denoiser to be conditioned at a different, learned timestep while keeping image update schedule unchanged. A lightweight timestep prediction module (TPM), trained with Group Relative Policy Optimization (GRPO), selects a more feasible conditioning timestep based on the current state, effectively choosing a desired noise level to control image detail and textural richness. At deployment, a scaling hyper-parameter can be used to interpolate between the original and de-synchronized timesteps, enabling conservative or aggressive adjustments. To keep the study computationally affordable, we cap the inference at 15 steps for SD3.5 and 10 steps for Flux. Evaluated on Stable Diffusion 3.5 Medium and Flux.1-dev across MS-COCO 2014 and T2I-CompBench datasets, our method optimizes a composite reward that averages Image Reward, HPSv2, CLIP Score and Pick Score, and shows consistent improvement.

Paper Structure

This paper contains 26 sections, 19 equations, 6 figures, 2 tables, 2 algorithms.

Figures (6)

  • Figure 1: Asynchronous Inference.
  • Figure 2: Deviation scaling result. Each chart plots evaluation metrics as we scale the per-step deviation $(r_k^\ast-0.5)$ from (\ref{['eq:deviation']}) using the factor $\gamma$ in (\ref{['eq:scaling']}) by 50%, 100% and 200%. Synchronous inference baseline and comparative experiments with lifted deviation bound (\ref{['eq:deviation prime']}) denoted as $2 * (r_k^\ast-0.5)$ are also covered. Curves include ImageReward, HPSv2, CLIP, and PickScore (all higher is better) on MS-COCO 2014 and T2I-CompBench from SD and Flux models.
  • Figure 3: Per-step deviation: $(r-0.5)$ distributions across datasets/models.
  • Figure 4: Per-step deviation of comparative experiments: $2\times (r-0.5)$.
  • Figure 5: Qualitative results. Each image in each panel from left to right correspond to: synchronous inference (baseline); 50%, 100% and 200% scaled asynchronous inference with deviation $(r_k^\ast-0.5)$; comparative asynchronous inference with deviation $2 * (r_k^\ast-0.5)$.
  • ...and 1 more figures