Table of Contents
Fetching ...

Zigzag Diffusion Sampling: Diffusion Models Can Self-Improve via Self-Reflection

Lichen Bai, Shitong Shao, Zikai Zhou, Zipeng Qi, Zhiqiang Xu, Haoyi Xiong, Zeke Xie

TL;DR

This work tackles the trade-off between image fidelity and prompt adherence in diffusion models by introducing diffusion self-reflection and Zigzag Diffusion Sampling (Z-Sampling). It identifies a semantic information latent in the guidance gap between denoising and inversion and leverages it through a stepwise zigzag denoise-invert process to accumulate semantic cues along the sampling path. The authors provide theoretical insights showing the cumulative semantic information gain in Z-Sampling surpasses end-to-end injection, and demonstrate broad empirical gains across multiple datasets, models, and metrics, including compatibility with orthogonal methods. The results indicate that Z-Sampling robustly improves generation quality and alignment for challenging prompts while maintaining efficiency, suggesting practical impact for high-fidelity, semantically accurate image synthesis. This approach opens avenues for applying self-reflection concepts to other diffusion-based tasks (e.g., video, 3D) and motivates further theoretical understanding of semantic information flow in latent spaces.

Abstract

Diffusion models, the most popular generative paradigm so far, can inject conditional information into the generation path to guide the latent towards desired directions. However, existing text-to-image diffusion models often fail to maintain high image quality and high prompt-image alignment for those challenging prompts. To mitigate this issue and enhance existing pretrained diffusion models, we mainly made three contributions in this paper. First, we propose diffusion self-reflection that alternately performs denoising and inversion and demonstrate that such diffusion self-reflection can leverage the guidance gap between denoising and inversion to capture prompt-related semantic information with theoretical and empirical evidence. Second, motivated by theoretical analysis, we derive Zigzag Diffusion Sampling (Z-Sampling), a novel self-reflection-based diffusion sampling method that leverages the guidance gap between denosing and inversion to accumulate semantic information step by step along the sampling path, leading to improved sampling results. Moreover, as a plug-and-play method, Z-Sampling can be generally applied to various diffusion models (e.g., accelerated ones and Transformer-based ones) with very limited coding and computational costs. Third, our extensive experiments demonstrate that Z-Sampling can generally and significantly enhance generation quality across various benchmark datasets, diffusion models, and performance evaluation metrics. For example, DreamShaper with Z-Sampling can self-improve with the HPSv2 winning rate up to 94% over the original results. Moreover, Z-Sampling can further enhance existing diffusion models combined with other orthogonal methods, including Diffusion-DPO.

Zigzag Diffusion Sampling: Diffusion Models Can Self-Improve via Self-Reflection

TL;DR

This work tackles the trade-off between image fidelity and prompt adherence in diffusion models by introducing diffusion self-reflection and Zigzag Diffusion Sampling (Z-Sampling). It identifies a semantic information latent in the guidance gap between denoising and inversion and leverages it through a stepwise zigzag denoise-invert process to accumulate semantic cues along the sampling path. The authors provide theoretical insights showing the cumulative semantic information gain in Z-Sampling surpasses end-to-end injection, and demonstrate broad empirical gains across multiple datasets, models, and metrics, including compatibility with orthogonal methods. The results indicate that Z-Sampling robustly improves generation quality and alignment for challenging prompts while maintaining efficiency, suggesting practical impact for high-fidelity, semantically accurate image synthesis. This approach opens avenues for applying self-reflection concepts to other diffusion-based tasks (e.g., video, 3D) and motivates further theoretical understanding of semantic information flow in latent spaces.

Abstract

Diffusion models, the most popular generative paradigm so far, can inject conditional information into the generation path to guide the latent towards desired directions. However, existing text-to-image diffusion models often fail to maintain high image quality and high prompt-image alignment for those challenging prompts. To mitigate this issue and enhance existing pretrained diffusion models, we mainly made three contributions in this paper. First, we propose diffusion self-reflection that alternately performs denoising and inversion and demonstrate that such diffusion self-reflection can leverage the guidance gap between denoising and inversion to capture prompt-related semantic information with theoretical and empirical evidence. Second, motivated by theoretical analysis, we derive Zigzag Diffusion Sampling (Z-Sampling), a novel self-reflection-based diffusion sampling method that leverages the guidance gap between denosing and inversion to accumulate semantic information step by step along the sampling path, leading to improved sampling results. Moreover, as a plug-and-play method, Z-Sampling can be generally applied to various diffusion models (e.g., accelerated ones and Transformer-based ones) with very limited coding and computational costs. Third, our extensive experiments demonstrate that Z-Sampling can generally and significantly enhance generation quality across various benchmark datasets, diffusion models, and performance evaluation metrics. For example, DreamShaper with Z-Sampling can self-improve with the HPSv2 winning rate up to 94% over the original results. Moreover, Z-Sampling can further enhance existing diffusion models combined with other orthogonal methods, including Diffusion-DPO.

Paper Structure

This paper contains 63 sections, 3 theorems, 25 equations, 27 figures, 18 tables, 2 algorithms.

Key Result

Theorem 1

For a random latent $x_{T} \in \mathcal{N}$ and an inverted latent $\tilde{x}_{T}$ given by equation eq: math_inversion, the latent difference $\delta_{end2end}$ between $x_{T}$ and $\tilde{x}_{T}$ is where $h_{t} = \sqrt{1/\alpha_{t}-1}-\sqrt{1/\alpha_{t-1}-1}$, and $\epsilon_{\theta}^{t}(\cdot)$ is the predicted score given by equation eq:cfg_eq.

Figures (27)

  • Figure 1: The qualitative results of Z-Sampling demonstrate the effectiveness of our method in various aspects, such as style, position, color, counting, text rendering, and object co-occurrence. We present more cases in Appendix \ref{['sec: more_qualitative_res']}.
  • Figure 2: Semantic-rich latents effectively generate images aligned with intended semantics. For instance, the random latent (seed 21) is better suited for generating images related to the concept of "flowers". We present more cases in Appendix \ref{['sec: prior_information']}.
  • Figure 3: If the latent carries semantic information, we can obtain prompt-related results from this latent even without conditional guidance.
  • Figure 4: Z-Sampling
  • Figure 5: The cross-attention map highlights the interaction between the entity token (red color) and latent variables. Z-Sampling optimizes the latent so that it is more suitable for generating concepts in the related-prompt. For example, in the zigzag path of the second column, semantically injected latents exhibit sharper attention on "dog" with relatively clear boundaries.
  • ...and 22 more figures

Theorems & Definitions (6)

  • Theorem 1: See the proof in Appendix \ref{['proof:1']}
  • Theorem 2: See the proof in Appendix \ref{['proof:2']}
  • Theorem 3: See the proof in Appendix \ref{['proof:3']}
  • Proof F.1: Theorem \ref{['theorem:1']}
  • Proof F.2: Theorem \ref{['theorem:2']}
  • Proof F.3: Theorem \ref{['theorem:3']}