Table of Contents
Fetching ...

Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation

Zijing Hu, Yunze Tong, Fengda Zhang, Junkun Yuan, Jun Xiao, Kun Kuang

TL;DR

This work identifies text-to-image misalignment as a fundamental limitation of synchronous diffusion denoising. It introduces AsynDM, which assigns per-pixel timesteps and uses a concave, mask-guided scheduling to denoise prompt-related regions more slowly, enabling these regions to benefit from clearer inter-pixel context. By extracting prompt-related masks from cross-attention maps and applying region-specific timestepping, AsynDM achieves substantial improvements in alignment metrics across diverse prompts while maintaining competitive sampling efficiency. The approach is plug-and-play, with thorough ablations, qualitative and quantitative evidence, and discussions on extensions to distortion reduction and editing tasks.

Abstract

Diffusion models have achieved impressive results in generating high-quality images. Yet, they often struggle to faithfully align the generated images with the input prompts. This limitation arises from synchronous denoising, where all pixels simultaneously evolve from random noise to clear images. As a result, during generation, the prompt-related regions can only reference the unrelated regions at the same noise level, failing to obtain clear context and ultimately impairing text-to-image alignment. To address this issue, we propose asynchronous diffusion models -- a novel framework that allocates distinct timesteps to different pixels and reformulates the pixel-wise denoising process. By dynamically modulating the timestep schedules of individual pixels, prompt-related regions are denoised more gradually than unrelated regions, thereby allowing them to leverage clearer inter-pixel context. Consequently, these prompt-related regions achieve better alignment in the final images. Extensive experiments demonstrate that our asynchronous diffusion models can significantly improve text-to-image alignment across diverse prompts. The code repository for this work is available at https://github.com/hu-zijing/AsynDM.

Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation

TL;DR

This work identifies text-to-image misalignment as a fundamental limitation of synchronous diffusion denoising. It introduces AsynDM, which assigns per-pixel timesteps and uses a concave, mask-guided scheduling to denoise prompt-related regions more slowly, enabling these regions to benefit from clearer inter-pixel context. By extracting prompt-related masks from cross-attention maps and applying region-specific timestepping, AsynDM achieves substantial improvements in alignment metrics across diverse prompts while maintaining competitive sampling efficiency. The approach is plug-and-play, with thorough ablations, qualitative and quantitative evidence, and discussions on extensions to distortion reduction and editing tasks.

Abstract

Diffusion models have achieved impressive results in generating high-quality images. Yet, they often struggle to faithfully align the generated images with the input prompts. This limitation arises from synchronous denoising, where all pixels simultaneously evolve from random noise to clear images. As a result, during generation, the prompt-related regions can only reference the unrelated regions at the same noise level, failing to obtain clear context and ultimately impairing text-to-image alignment. To address this issue, we propose asynchronous diffusion models -- a novel framework that allocates distinct timesteps to different pixels and reformulates the pixel-wise denoising process. By dynamically modulating the timestep schedules of individual pixels, prompt-related regions are denoised more gradually than unrelated regions, thereby allowing them to leverage clearer inter-pixel context. Consequently, these prompt-related regions achieve better alignment in the final images. Extensive experiments demonstrate that our asynchronous diffusion models can significantly improve text-to-image alignment across diverse prompts. The code repository for this work is available at https://github.com/hu-zijing/AsynDM.

Paper Structure

This paper contains 29 sections, 1 theorem, 12 equations, 12 figures, 5 tables.

Key Result

Proposition 1

(See proof in Appendix sec:dev_concave) Let $f(i):[0,T]\to\mathbb{R}$ be a concave function with $f(0)=T$ and $f(T)=0$. For any $i_0$ with $0<i_0<T$ and any $t_0$ with $T-i_0 \le t_0 \le f(i_0)$, there exist unique constants $a,b$ such that the shifted function $f(i-a)+b$ satisfies:

Figures (12)

  • Figure 1: Existing diffusion models generate images through synchronous denoising, where all pixels are simultaneously denoised step-by-step from noises to images, hindering text-to-image alignment. Asynchronous diffusion models denoise the prompt-related regions more gradually than other regions, thereby receiving clearer inter-pixel context and ultimately achieving improved alignment.
  • Figure 2: Asynchronous diffusion models improve text-to-image alignment by (a) assigning distinct timesteps to different pixels, where faster-denoised regions provide clearer context, serving as better references for slower ones, and (b) using masks extracted from cross-attention to identify prompt-related regions and dynamically modulate pixel-level timestep schedules.
  • Figure 3: Any point located within the shaded area can reach $t=0$ along appropriately shifted $f$.
  • Figure 4: The samples generated by AsynDM and baseline methods across diverse prompts. The images generated by AsynDM show better text-to-image alignment.
  • Figure 5: Human preference rates for text-to-image alignment of the images generated by DM, $\text{DM}_\text{concave}$ and AsynDM.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Proposition 1