Progressive Focused Transformer for Single Image Super-Resolution
Wei Long, Xingyu Zhou, Leheng Zhang, Shuhang Gu
TL;DR
This work tackles the computational bottleneck of transformer-based single-image super-resolution by introducing Progressive Focused Transformer (PFT), featuring Progressive Focused Attention (PFA) that links attention maps across layers to emphasize consistently relevant tokens while filtering out noise. PFA enables larger interaction windows and reduces unnecessary similarity calculations through a cross-layer inheritance mechanism and sparse matrix multiplication, yielding state-of-the-art PSNR/SSIM with lower FLOPs. The authors provide extensive ablations, showing substantial gains from progressive attention, optimal focus ratios, and larger window sizes, and demonstrate strong performance against recent SR methods on standard benchmarks. The approach promises practical efficiency improvements for high-resolution image restoration and could extend to other vision tasks that benefit from broad contextual interactions.
Abstract
Transformer-based methods have achieved remarkable results in image super-resolution tasks because they can capture non-local dependencies in low-quality input images. However, this feature-intensive modeling approach is computationally expensive because it calculates the similarities between numerous features that are irrelevant to the query features when obtaining attention weights. These unnecessary similarity calculations not only degrade the reconstruction performance but also introduce significant computational overhead. How to accurately identify the features that are important to the current query features and avoid similarity calculations between irrelevant features remains an urgent problem. To address this issue, we propose a novel and effective Progressive Focused Transformer (PFT) that links all isolated attention maps in the network through Progressive Focused Attention (PFA) to focus attention on the most important tokens. PFA not only enables the network to capture more critical similar features, but also significantly reduces the computational cost of the overall network by filtering out irrelevant features before calculating similarities. Extensive experiments demonstrate the effectiveness of the proposed method, achieving state-of-the-art performance on various single image super-resolution benchmarks.
