Table of Contents
Fetching ...

ATD: Improved Transformer with Adaptive Token Dictionary for Image Restoration

Leheng Zhang, Wei Long, Yawei Li, Xingyu Zhou, Xiaorui Zhao, Shuhang Gu

TL;DR

Adaptive Token Dictionary is proposed, a novel transformer-based architecture for image restoration that enables global dependency modeling with linear complexity relative to image size and develops ATD-U, a multi-scale variant of ATD that achieves state-of-the-art performance on multiple image super-resolution benchmarks.

Abstract

Recently, Transformers have gained significant popularity in image restoration tasks such as image super-resolution and denoising, owing to their superior performance. However, balancing performance and computational burden remains a long-standing problem for transformer-based architectures. Due to the quadratic complexity of self-attention, existing methods often restrict attention to local windows, resulting in limited receptive field and suboptimal performance. To address this issue, we propose Adaptive Token Dictionary (ATD), a novel transformer-based architecture for image restoration that enables global dependency modeling with linear complexity relative to image size. The ATD model incorporates a learnable token dictionary, which summarizes external image priors (i.e., typical image structures) during the training process. To utilize this information, we introduce a token dictionary cross-attention (TDCA) mechanism that enhances the input features via interaction with the learned dictionary. Furthermore, we exploit the category information embedded in the TDCA attention maps to group input features into multiple categories, each representing a cluster of similar features across the image and serving as an attention group. We also integrate the learned category information into the feed-forward network to further improve feature fusion. ATD and its lightweight version ATD-light, achieve state-of-the-art performance on multiple image super-resolution benchmarks. Moreover, we develop ATD-U, a multi-scale variant of ATD, to address other image restoration tasks, including image denoising and JPEG compression artifacts removal. Extensive experiments demonstrate the superiority of out proposed models, both quantitatively and qualitatively.

ATD: Improved Transformer with Adaptive Token Dictionary for Image Restoration

TL;DR

Adaptive Token Dictionary is proposed, a novel transformer-based architecture for image restoration that enables global dependency modeling with linear complexity relative to image size and develops ATD-U, a multi-scale variant of ATD that achieves state-of-the-art performance on multiple image super-resolution benchmarks.

Abstract

Recently, Transformers have gained significant popularity in image restoration tasks such as image super-resolution and denoising, owing to their superior performance. However, balancing performance and computational burden remains a long-standing problem for transformer-based architectures. Due to the quadratic complexity of self-attention, existing methods often restrict attention to local windows, resulting in limited receptive field and suboptimal performance. To address this issue, we propose Adaptive Token Dictionary (ATD), a novel transformer-based architecture for image restoration that enables global dependency modeling with linear complexity relative to image size. The ATD model incorporates a learnable token dictionary, which summarizes external image priors (i.e., typical image structures) during the training process. To utilize this information, we introduce a token dictionary cross-attention (TDCA) mechanism that enhances the input features via interaction with the learned dictionary. Furthermore, we exploit the category information embedded in the TDCA attention maps to group input features into multiple categories, each representing a cluster of similar features across the image and serving as an attention group. We also integrate the learned category information into the feed-forward network to further improve feature fusion. ATD and its lightweight version ATD-light, achieve state-of-the-art performance on multiple image super-resolution benchmarks. Moreover, we develop ATD-U, a multi-scale variant of ATD, to address other image restoration tasks, including image denoising and JPEG compression artifacts removal. Extensive experiments demonstrate the superiority of out proposed models, both quantitatively and qualitatively.
Paper Structure (24 sections, 16 equations, 10 figures, 9 tables)

This paper contains 24 sections, 16 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Visual comparisons on different attention mechanisms. (a) Window-based self-attention constrain the attention to local windows due to the quadratic computational complexity, resulting in limited receptive field. (b) Our proposed token dictionary cross-attention leverages the typical image structures summarized in the token dictionary to incorporate external information to input features. (c) Our proposed adaptive category-based self-attention exploits relationships within categories, connecting distant but similar tokens across the image.
  • Figure 2: Overall architectures of the proposed ATD and ATD-U networks. Each ATD block contains several consecutive transformer layers and a learnable token dictionary. The transformer layer combines three attention branches: token dictionary cross-attention (\ref{['sec:TDCA']}), adaptive category-based self-attention (\ref{['sec:ACMSA']}), and window-based self-attention to enhance image feature. The attention map of TDCA branch is further utilized by the ACMSA branch for categorization and CFFN for dictionary entry selection, respectively.
  • Figure 3: Illustrations of the proposed (a) Token Dictionary Cross-Attention (TDCA), (b) Adaptive Category-based Multi-head Self-Attention (AC-MSA), and (c) Category-aware Feed-Forward Network (CFFN). Descriptions of TDCA and ACMSA are provided in Sec. \ref{['sec:TDCA']} and \ref{['sec:ACMSA']}, respectively. Further details of the $\operatorname{Categorize}$ operation can be found in Eq. \ref{['eq:categorize']} and Fig. \ref{['fig:categorize']}.
  • Figure 4: Distribution of the maximum attention values over image tokens with respect to the token dictionary in TDCA across different layers. Y-axis is log-scaled for better visualization.
  • Figure 5: Illustration of the proposed $\operatorname{Categorize}$ operation. The attention map is flattened and sorted based on the index of the maximum value in each row as denoted in Eq. \ref{['eq:categorize']} and Eq. \ref{['eq:sub-categorization']}. This effectively clusters tokens sharing similar features for self-attention. Subsequently, the $\operatorname{UnCategorize}$ operation applies the inverse permutation to restore the original spatial structure.
  • ...and 5 more figures