Crafting Query-Aware Selective Attention for Single Image Super-Resolution
Junyoung Kim, Youngrok Kim, Siyeol Jung, Donghyun Min
TL;DR
SSCAN tackles single image super-resolution by introducing a query-aware selective attention mechanism that focuses computation on regions most relevant to reconstruction. The core component, FGCA, selects the top-$k$ windows based on query-key similarity and applies attention only to those regions, yielding linear-like complexity with respect to image size when combined with fixed windows and FlashAttention. Empirical results show SSCAN outperforms existing selective-attention SR models (up to 0.14 dB PSNR on Urban100) while maintaining comparable parameter counts, and memory analyses indicate substantial reductions in footprint. These findings suggest a practical, scalable path for high-quality SR on large images, suitable for resource-constrained settings and on-device deployment.
Abstract
Single Image Super-Resolution (SISR) reconstructs high-resolution images from low-resolution inputs, enhancing image details. While Vision Transformer (ViT)-based models improve SISR by capturing long-range dependencies, they suffer from quadratic computational costs or employ selective attention mechanisms that do not explicitly focus on query-relevant regions. Despite these advancements, prior work has overlooked how selective attention mechanisms should be effectively designed for SISR. We propose SSCAN, which dynamically selects the most relevant key-value windows based on query similarity, ensuring focused feature extraction while maintaining efficiency. In contrast to prior approaches that apply attention globally or heuristically, our method introduces a query-aware window selection strategy that better aligns attention computation with important image regions. By incorporating fixed-sized windows, SSCAN reduces memory usage and enforces linear token-to-token complexity, making it scalable for large images. Our experiments demonstrate that SSCAN outperforms existing attention-based SISR methods, achieving up to 0.14 dB PSNR improvement on urban datasets, guaranteeing both computational efficiency and reconstruction quality in SISR.
