Dynamic Attention-Guided Diffusion for Image Super-Resolution

Brian B. Moser; Stanislav Frolov; Federico Raue; Sebastian Palacio; Andreas Dengel

Dynamic Attention-Guided Diffusion for Image Super-Resolution

Brian B. Moser, Stanislav Frolov, Federico Raue, Sebastian Palacio, Andreas Dengel

TL;DR

This work introduces You Only Diffuse Areas (YODA), a dynamic attention-guided diffusion framework for image super-resolution that concentrates iterative refinement on salient regions identified from low-resolution inputs via self-supervised attention maps (DINO). By employing time-dependent masks that expand refinement over diffusion steps and blending refined SR regions with untouched LR regions, YODA improves SR quality and stabilizes training, especially under small-batch regimes. Empirical results show consistent gains when integrating YODA with SR3, DiffBIR, and SRDiff for face and general SR tasks, including reduced color shifts and better perceptual metrics like LPIPS. The approach is plug-and-play with existing diffusion SR models and highlights the importance of content-aware diffusion in reducing artifacts while enhancing detail-rich regions, with potential implications for broader SR and content generation tasks.

Abstract

Diffusion models in image Super-Resolution (SR) treat all image regions uniformly, which risks compromising the overall image quality by potentially introducing artifacts during denoising of less-complex regions. To address this, we propose ``You Only Diffuse Areas'' (YODA), a dynamic attention-guided diffusion process for image SR. YODA selectively focuses on spatial regions defined by attention maps derived from the low-resolution images and the current denoising time step. This time-dependent targeting enables a more efficient conversion to high-resolution outputs by focusing on areas that benefit the most from the iterative refinement process, i.e., detail-rich objects. We empirically validate YODA by extending leading diffusion-based methods SR3, DiffBIR, and SRDiff. Our experiments demonstrate new state-of-the-art performances in face and general SR tasks across PSNR, SSIM, and LPIPS metrics. As a side effect, we find that YODA reduces color shift issues and stabilizes training with small batches.

Dynamic Attention-Guided Diffusion for Image Super-Resolution

TL;DR

Abstract

Paper Structure (16 sections, 12 equations, 15 figures, 5 tables)

This paper contains 16 sections, 12 equations, 15 figures, 5 tables.

Introduction
Background
DDPMs
DINO
Methodology
Identifying Key Regions
Time-Dependent Masking
Guided Backward Diffusion
Optimization
Experiments
Choosing Good Attention Maps
Face Super-Resolution
General Super-Resolution
Conclusion
Limitations & Future Work
...and 1 more sections

Figures (15)

Figure 1: Overview of YODA. First, extract an attention map $\mathbf{A}$ from the LR input. Next, use the values of $\mathbf{A}$ to produce a time-dependent masking $\mathbf{M}(t)$. For $t:T \to 0$, the area of selected pixels expands from detail-rich regions to the whole image. Our diffusion process uses these masks for dynamic and attention-guided refinement, emphasizing regions differently. More specifically, it starts with masked areas that need refinement (derived from $\mathbf{z}_{t}$ and $\mathbf{M}(t)$) and LR regions, which retain the noise level needed for the next time step. Finally, the SR and LR areas are combined to form a whole image with no masked-out regions for the next iteration.
Figure 2: (Left) Comparison of various methods to extract attention maps used for our method (blue = low attention; yellow = high attention). Top row denotes maps derived from ResNet-50 using DINO. It shows various attention head outputs and the max aggregation of all attention maps (MAX). Bottom row denotes non-learnable methods, namely Gaussian, Edge-based, and using SIFT's points of interest. (Right) Comparison of different attention maps with SR3+YODA for $16 \rightarrow 128$ on CelebA-HQ. Aggregating the attention maps extracted with DINO and ResNet-50 backbone under the MAX strategy performs best. The attention maps are then used for dynamic binary masking.
Figure 3: (Left) Ratio comparison between diffused pixels using our time-dependent masking approach and the total number of pixel updates in standard diffusion. On average, DINO with a ResNet-50 backbone leads to more pixel updates than the VIT-S/8 backbone. The lower bound, defined by $l$, is a threshold to eliminate areas that would never undergo diffusion. (Right) Refined image area in percentage across time steps for the MAX combination. Note that the sampling process goes from $T=500$ to $T=0$. ResNet-50 initiates the refinement process much earlier, advances more rapidly toward refining the entire image, and has a higher standard deviation.
Figure 4: SR3 and SR3+YODA reconstructions, $64 \rightarrow 256$ (4×). SR3 suffers from color shifting, as also observed by wang2023exploitingchoi2022perception. YODA solves this issue and produces higher-quality reconstructions.
Figure 5: Regional LPIPS comparison across normalized attention values for CelebA, $64 \rightarrow 256$ (4×). We use 0.01 intervals and fit a polynomial through the means. High-attention areas are perceptually relevant and correspond to more difficult pixels (higher LPIPS). YODA reaches better scores, especially within high-attention areas. Note that dynamic masking stops around $t \approx 0.6 \cdot T$, see \ref{['fig:iterations']}.
...and 10 more figures

Dynamic Attention-Guided Diffusion for Image Super-Resolution

TL;DR

Abstract

Dynamic Attention-Guided Diffusion for Image Super-Resolution

Authors

TL;DR

Abstract

Table of Contents

Figures (15)