Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary

Leheng Zhang; Yawei Li; Xingyu Zhou; Xiaorui Zhao; Shuhang Gu

Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary

Leheng Zhang, Yawei Li, Xingyu Zhou, Xiaorui Zhao, Shuhang Gu

TL;DR

This work tackles the limitation of local receptive fields in SR transformers by introducing Adaptive Token Dictionary (ATD), which injects external priors through a learnable token dictionary. The core components are Token Dictionary Cross-Attention (TDCA), Adaptive Dictionary Refinement (ADR), and Adaptive Category-based Attention (AC-MSA), which together enable global feature augmentation and nonlocal, category-aware self-attention. Empirical results on standard SR benchmarks show state-of-the-art PSNR/SSIM with competitive model size and FLOPs, with ablations confirming the substantial contributions of TDCA, ADR, and AC-MSA. The approach demonstrates that integrating dictionary-inspired priors with adaptive, content-aware attention substantially improves SR quality and generalization, offering a scalable path to long-range dependency modeling in vision transformers.

Abstract

Single Image Super-Resolution is a classic computer vision problem that involves estimating high-resolution (HR) images from low-resolution (LR) ones. Although deep neural networks (DNNs), especially Transformers for super-resolution, have seen significant advancements in recent years, challenges still remain, particularly in limited receptive field caused by window-based self-attention. To address these issues, we introduce a group of auxiliary Adaptive Token Dictionary to SR Transformer and establish an ATD-SR method. The introduced token dictionary could learn prior information from training data and adapt the learned prior to specific testing image through an adaptive refinement step. The refinement strategy could not only provide global information to all input tokens but also group image tokens into categories. Based on category partitions, we further propose a category-based self-attention mechanism designed to leverage distant but similar tokens for enhancing input features. The experimental results show that our method achieves the best performance on various single image super-resolution benchmarks.

Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary

TL;DR

Abstract

Paper Structure (25 sections, 8 equations, 9 figures, 6 tables)

This paper contains 25 sections, 8 equations, 9 figures, 6 tables.

Introduction
Related Works
Methodology
Motivation
Token Dictionary Cross-Attention
Adaptive Dictionary Refinement
Adaptive Category-based Attention
The Overall Network Architecture
Experiments
Experimental Settings
Ablation Study
Effects of TDCA, ADR, and AC-MSA.
Effects of different designs of category-based attention.
Effects of sub-category size $n_s$.
Effects of token dictionary size $M$.
...and 10 more sections

Figures (9)

Figure 1: Three different kinds of attention mechanism: (a) window-based self-attention exploits tokens in the same local window to enhance image tokens; (b) our proposed token dictionary cross-attention leverages the auxiliary dictionary to summarize and incorporate global information to the image tokens; (c) our proposed category-based self-attention adopts category labels to divide image tokens.
Figure 2: The proposed (a) Token Dictionary Cross-Attention (TDCA) and (b) Adaptive Category-based Multi-head Self-Attention (AC-MSA). In \ref{['fig:acmsa']}, we omit the details of dividing categories $\theta$ into sub-categories $\phi$ for simplicity and better understanding. More details of TDCA and AC-MSA can be found in \ref{['sec:TDCA']} and \ref{['sec:AC-MSA']}.
Figure 3: The overall architecture of the proposed ATD network. Token dictionary cross-attention (\ref{['fig:tdca']}), adaptive category-based MSA (\ref{['fig:acmsa']}), and window-based MSA liu2021swin form the main structure of the transformer layer. Each ATD block contains several transformer layers and an initial token dictionary $\bm{D}^{(1)}$. The token dictionary is recurrently adapted via the adaptive dictionary refinement operation.
Figure 4: Visual comparisons of ATD and other state-of-the-art image super-resolution methods.
Figure 5: Visualization of categorization results of adaptive category-based MSA. (a) is the input image. The white part of each binarized image from (b) - (e) represents a single attention category.
...and 4 more figures

Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary

TL;DR

Abstract

Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary

Authors

TL;DR

Abstract

Table of Contents

Figures (9)