Table of Contents
Fetching ...

Dynamic Importance in Diffusion U-Net for Enhanced Image Synthesis

Xi Wang, Ziqi He, Yang Zhou

TL;DR

The paper tackles the underutilized dynamic evolution of attention block importance within diffusion U-Nets during inference. It introduces Importance Probe (IP) to quantify time-varying block importance and an adaptive, training-free re-weighting schedule that scales Transformer outputs per step, guided by a voting-based ranking. The authors provide practical evidence that re-weighting can enhance sampling efficiency and sample aesthetics, validated through dynamic attention pruning and extensive experiments across SD/SDXL variants, preserving identity in generated images. The work offers a model-agnostic, training-free enhancement for diffusion-based image generation and editing with significant practical impact.

Abstract

Traditional diffusion models typically employ a U-Net architecture. Previous studies have unveiled the roles of attention blocks in the U-Net. However, they overlook the dynamic evolution of their importance during the inference process, which hinders their further exploitation to improve image applications. In this study, we first theoretically proved that, re-weighting the outputs of the Transformer blocks within the U-Net is a "free lunch" for improving the signal-to-noise ratio during the sampling process. Next, we proposed Importance Probe to uncover and quantify the dynamic shifts in importance of the Transformer blocks throughout the denoising process. Finally, we design an adaptive importance-based re-weighting schedule tailored to specific image generation and editing tasks. Experimental results demonstrate that, our approach significantly improves the efficiency of the inference process, and enhances the aesthetic quality of the samples with identity consistency. Our method can be seamlessly integrated into any U-Net-based architecture. Code: https://github.com/Hytidel/UNetReweighting

Dynamic Importance in Diffusion U-Net for Enhanced Image Synthesis

TL;DR

The paper tackles the underutilized dynamic evolution of attention block importance within diffusion U-Nets during inference. It introduces Importance Probe (IP) to quantify time-varying block importance and an adaptive, training-free re-weighting schedule that scales Transformer outputs per step, guided by a voting-based ranking. The authors provide practical evidence that re-weighting can enhance sampling efficiency and sample aesthetics, validated through dynamic attention pruning and extensive experiments across SD/SDXL variants, preserving identity in generated images. The work offers a model-agnostic, training-free enhancement for diffusion-based image generation and editing with significant practical impact.

Abstract

Traditional diffusion models typically employ a U-Net architecture. Previous studies have unveiled the roles of attention blocks in the U-Net. However, they overlook the dynamic evolution of their importance during the inference process, which hinders their further exploitation to improve image applications. In this study, we first theoretically proved that, re-weighting the outputs of the Transformer blocks within the U-Net is a "free lunch" for improving the signal-to-noise ratio during the sampling process. Next, we proposed Importance Probe to uncover and quantify the dynamic shifts in importance of the Transformer blocks throughout the denoising process. Finally, we design an adaptive importance-based re-weighting schedule tailored to specific image generation and editing tasks. Experimental results demonstrate that, our approach significantly improves the efficiency of the inference process, and enhances the aesthetic quality of the samples with identity consistency. Our method can be seamlessly integrated into any U-Net-based architecture. Code: https://github.com/Hytidel/UNetReweighting

Paper Structure

This paper contains 22 sections, 1 theorem, 8 equations, 4 figures, 6 tables.

Key Result

Proposition 1

(Proof in Appendix) The variance of the error where $A_i$ denotes the mapping transformation from the output of the $i$-th Transformer block to the final noise prediction.

Figures (4)

  • Figure 1: Our approach enhances the U-Net capability in the following tasks without additional training or fine-tuning: (a) improving sampling efficiency; (b) & (c) enhancing the visual aesthetics of samples with identity consistency; and (d) achieving better fidelity in pruned sampling. Images are evaluated at 512$\times$512/1024$\times$1024 px with the SD/SDXL model.
  • Figure 2: Illustration of how the outputs of Transformer blocks are scaled before being passed to subsequent ResNet blocks.
  • Figure 3: Scatter plot of FID and LPIPS under different skipping strategies (the further lower-left, the better). Baseline strategies are represented by blue circles, unique points from our strategies are shown as pink triangles, while points overlapping with baseline points are marked with purple stars.
  • Figure 4: Line chart showing the effect of re-weighting on SD-Turbo and SDXL-Turbo with fixed $high = 1.1$ as $low$ varies. Lines of the same color represent the same category, where dashed lines indicate the vanilla schedule, and solid lines represent our re-weighting schedule.

Theorems & Definitions (1)

  • Proposition 1