Table of Contents
Fetching ...

Progressive Focused Transformer for Single Image Super-Resolution

Wei Long, Xingyu Zhou, Leheng Zhang, Shuhang Gu

TL;DR

This work tackles the computational bottleneck of transformer-based single-image super-resolution by introducing Progressive Focused Transformer (PFT), featuring Progressive Focused Attention (PFA) that links attention maps across layers to emphasize consistently relevant tokens while filtering out noise. PFA enables larger interaction windows and reduces unnecessary similarity calculations through a cross-layer inheritance mechanism and sparse matrix multiplication, yielding state-of-the-art PSNR/SSIM with lower FLOPs. The authors provide extensive ablations, showing substantial gains from progressive attention, optimal focus ratios, and larger window sizes, and demonstrate strong performance against recent SR methods on standard benchmarks. The approach promises practical efficiency improvements for high-resolution image restoration and could extend to other vision tasks that benefit from broad contextual interactions.

Abstract

Transformer-based methods have achieved remarkable results in image super-resolution tasks because they can capture non-local dependencies in low-quality input images. However, this feature-intensive modeling approach is computationally expensive because it calculates the similarities between numerous features that are irrelevant to the query features when obtaining attention weights. These unnecessary similarity calculations not only degrade the reconstruction performance but also introduce significant computational overhead. How to accurately identify the features that are important to the current query features and avoid similarity calculations between irrelevant features remains an urgent problem. To address this issue, we propose a novel and effective Progressive Focused Transformer (PFT) that links all isolated attention maps in the network through Progressive Focused Attention (PFA) to focus attention on the most important tokens. PFA not only enables the network to capture more critical similar features, but also significantly reduces the computational cost of the overall network by filtering out irrelevant features before calculating similarities. Extensive experiments demonstrate the effectiveness of the proposed method, achieving state-of-the-art performance on various single image super-resolution benchmarks.

Progressive Focused Transformer for Single Image Super-Resolution

TL;DR

This work tackles the computational bottleneck of transformer-based single-image super-resolution by introducing Progressive Focused Transformer (PFT), featuring Progressive Focused Attention (PFA) that links attention maps across layers to emphasize consistently relevant tokens while filtering out noise. PFA enables larger interaction windows and reduces unnecessary similarity calculations through a cross-layer inheritance mechanism and sparse matrix multiplication, yielding state-of-the-art PSNR/SSIM with lower FLOPs. The authors provide extensive ablations, showing substantial gains from progressive attention, optimal focus ratios, and larger window sizes, and demonstrate strong performance against recent SR methods on standard benchmarks. The approach promises practical efficiency improvements for high-resolution image restoration and could extend to other vision tasks that benefit from broad contextual interactions.

Abstract

Transformer-based methods have achieved remarkable results in image super-resolution tasks because they can capture non-local dependencies in low-quality input images. However, this feature-intensive modeling approach is computationally expensive because it calculates the similarities between numerous features that are irrelevant to the query features when obtaining attention weights. These unnecessary similarity calculations not only degrade the reconstruction performance but also introduce significant computational overhead. How to accurately identify the features that are important to the current query features and avoid similarity calculations between irrelevant features remains an urgent problem. To address this issue, we propose a novel and effective Progressive Focused Transformer (PFT) that links all isolated attention maps in the network through Progressive Focused Attention (PFA) to focus attention on the most important tokens. PFA not only enables the network to capture more critical similar features, but also significantly reduces the computational cost of the overall network by filtering out irrelevant features before calculating similarities. Extensive experiments demonstrate the effectiveness of the proposed method, achieving state-of-the-art performance on various single image super-resolution benchmarks.

Paper Structure

This paper contains 11 sections, 8 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Comparison of different attention mechanisms. (a) Vanilla self-attention calculates attention weights in the whole window and generates non-zero weights for both highly relevant and less relevant tokens. (b) Sparse Attention is able to filter out the impact of less relevant tokens with small weights, but still requires the calculation of attention weights for all tokens. (c) Progressive Focused Attention connects isolated attention maps, leveraging attention weights to skip unnecessary computations and better aggregate relevant tokens.
  • Figure 2: The overall architecture of PFT. PFA Block consists of $M$ Progressive Focused Attention Layers (PFAL). Each PFA takes both image features and aggregated PFA maps from previous layers as input. Sparse Matrix Multiplication (SMM) ensures each row of $Q$ interacts only with sparse columns of $K^T$, producing calculated attention maps. After applying the Hadamard product and sparse focusing, PFA maps of the current layer are obtained and used with the $V$ matrix in an SMM operation to generate attention-aggregated features.
  • Figure 3: Visual comparison of attention distributions in the 18th layer. SA and top-$k$ Attention distribute attention broadly, failing to focus on the most relevant areas. In contrast, PFA filters out irrelevant tokens and concentrates attention on key regions. By reducing computational costs, it enables the use of a larger 32$\times$32 window for more extensive feature interactions.
  • Figure 4: Visual comparison of SR reconstruction results.
  • Figure 5: The visualization of attention distributions across different layers of the PFT-light model demonstrates the progressive filtering capability of the PFA module.
  • ...and 3 more figures