Dynamic Attention-Guided Diffusion for Image Super-Resolution
Brian B. Moser, Stanislav Frolov, Federico Raue, Sebastian Palacio, Andreas Dengel
TL;DR
This work introduces You Only Diffuse Areas (YODA), a dynamic attention-guided diffusion framework for image super-resolution that concentrates iterative refinement on salient regions identified from low-resolution inputs via self-supervised attention maps (DINO). By employing time-dependent masks that expand refinement over diffusion steps and blending refined SR regions with untouched LR regions, YODA improves SR quality and stabilizes training, especially under small-batch regimes. Empirical results show consistent gains when integrating YODA with SR3, DiffBIR, and SRDiff for face and general SR tasks, including reduced color shifts and better perceptual metrics like LPIPS. The approach is plug-and-play with existing diffusion SR models and highlights the importance of content-aware diffusion in reducing artifacts while enhancing detail-rich regions, with potential implications for broader SR and content generation tasks.
Abstract
Diffusion models in image Super-Resolution (SR) treat all image regions uniformly, which risks compromising the overall image quality by potentially introducing artifacts during denoising of less-complex regions. To address this, we propose ``You Only Diffuse Areas'' (YODA), a dynamic attention-guided diffusion process for image SR. YODA selectively focuses on spatial regions defined by attention maps derived from the low-resolution images and the current denoising time step. This time-dependent targeting enables a more efficient conversion to high-resolution outputs by focusing on areas that benefit the most from the iterative refinement process, i.e., detail-rich objects. We empirically validate YODA by extending leading diffusion-based methods SR3, DiffBIR, and SRDiff. Our experiments demonstrate new state-of-the-art performances in face and general SR tasks across PSNR, SSIM, and LPIPS metrics. As a side effect, we find that YODA reduces color shift issues and stabilizes training with small batches.
