Table of Contents
Fetching ...

Sharing Key Semantics in Transformer Makes Efficient Image Restoration

Bin Ren, Yawei Li, Jingyun Liang, Rakesh Ranjan, Mengyuan Liu, Rita Cucchiara, Luc Van Gool, Ming-Hsuan Yang, Nicu Sebe

TL;DR

This work tackles the inefficiency of global self-attention in Vision Transformers for image restoration by introducing SemanIR, which builds a per-stage Key-Semantic Dictionary that links each degraded patch to its top-$k$ semantically related patches via KNN. The dictionary is shared across all transformer layers within a stage, restricting attention to semantically relevant patches and yielding near-linear complexity within each window. Across six IR tasks, SemanIR achieves state-of-the-art performance and favorable efficiency, supported by extensive ablations on the top-$k$ parameter and attention implementations (Triton, Torch-Gather, Torch-Mask). The approach demonstrates the value of stage-wide semantic sharing for restoring degraded images and offers practical insights into efficient ViT-based IR deployments, with code and models available at the authors' GitHub page.

Abstract

Image Restoration (IR), a classic low-level vision task, has witnessed significant advancements through deep models that effectively model global information. Notably, the emergence of Vision Transformers (ViTs) has further propelled these advancements. When computing, the self-attention mechanism, a cornerstone of ViTs, tends to encompass all global cues, even those from semantically unrelated objects or regions. This inclusivity introduces computational inefficiencies, particularly noticeable with high input resolution, as it requires processing irrelevant information, thereby impeding efficiency. Additionally, for IR, it is commonly noted that small segments of a degraded image, particularly those closely aligned semantically, provide particularly relevant information to aid in the restoration process, as they contribute essential contextual cues crucial for accurate reconstruction. To address these challenges, we propose boosting IR's performance by sharing the key semantics via Transformer for IR (\ie, SemanIR) in this paper. Specifically, SemanIR initially constructs a sparse yet comprehensive key-semantic dictionary within each transformer stage by establishing essential semantic connections for every degraded patch. Subsequently, this dictionary is shared across all subsequent transformer blocks within the same stage. This strategy optimizes attention calculation within each block by focusing exclusively on semantically related components stored in the key-semantic dictionary. As a result, attention calculation achieves linear computational complexity within each window. Extensive experiments across 6 IR tasks confirm the proposed SemanIR's state-of-the-art performance, quantitatively and qualitatively showcasing advancements. The visual results, code, and trained models are available at https://github.com/Amazingren/SemanIR.

Sharing Key Semantics in Transformer Makes Efficient Image Restoration

TL;DR

This work tackles the inefficiency of global self-attention in Vision Transformers for image restoration by introducing SemanIR, which builds a per-stage Key-Semantic Dictionary that links each degraded patch to its top- semantically related patches via KNN. The dictionary is shared across all transformer layers within a stage, restricting attention to semantically relevant patches and yielding near-linear complexity within each window. Across six IR tasks, SemanIR achieves state-of-the-art performance and favorable efficiency, supported by extensive ablations on the top- parameter and attention implementations (Triton, Torch-Gather, Torch-Mask). The approach demonstrates the value of stage-wide semantic sharing for restoring degraded images and offers practical insights into efficient ViT-based IR deployments, with code and models available at the authors' GitHub page.

Abstract

Image Restoration (IR), a classic low-level vision task, has witnessed significant advancements through deep models that effectively model global information. Notably, the emergence of Vision Transformers (ViTs) has further propelled these advancements. When computing, the self-attention mechanism, a cornerstone of ViTs, tends to encompass all global cues, even those from semantically unrelated objects or regions. This inclusivity introduces computational inefficiencies, particularly noticeable with high input resolution, as it requires processing irrelevant information, thereby impeding efficiency. Additionally, for IR, it is commonly noted that small segments of a degraded image, particularly those closely aligned semantically, provide particularly relevant information to aid in the restoration process, as they contribute essential contextual cues crucial for accurate reconstruction. To address these challenges, we propose boosting IR's performance by sharing the key semantics via Transformer for IR (\ie, SemanIR) in this paper. Specifically, SemanIR initially constructs a sparse yet comprehensive key-semantic dictionary within each transformer stage by establishing essential semantic connections for every degraded patch. Subsequently, this dictionary is shared across all subsequent transformer blocks within the same stage. This strategy optimizes attention calculation within each block by focusing exclusively on semantically related components stored in the key-semantic dictionary. As a result, attention calculation achieves linear computational complexity within each window. Extensive experiments across 6 IR tasks confirm the proposed SemanIR's state-of-the-art performance, quantitatively and qualitatively showcasing advancements. The visual results, code, and trained models are available at https://github.com/Amazingren/SemanIR.
Paper Structure (20 sections, 7 equations, 19 figures, 11 tables, 1 algorithm)

This paper contains 20 sections, 7 equations, 19 figures, 11 tables, 1 algorithm.

Figures (19)

  • Figure 1: (a) The CNN filter captures information only within a local region. (b) The standard MLP/Transformer architectures take full input in a long-sequence manner. (c) The window-size multi-head self-attention (MSA) mechanism builds a full connection within each window. (d) Position-fixed sparse connection. (e) The proposed Key-Semantic connection.
  • Figure 2: The proposed SemanIR mainly consists of a convolutional feature extractor, the main body of SemanIR for representation learning, and an image reconstructor. The main body in columnar shape shown here is for image SR, while the U-shaped structure (shown in Appx.\ref{['subsec:appx_architecture']}) is used for other IR tasks. (b) The transformer layer of our SemanIR. The toy example of $k$=3 for (c) the Key-semantic dictionary construction and (d) the attention of each Layer.
  • Figure 3: The impact of $k$ with different inference $k$ value. Circle size represents FLOPs.
  • Figure 4: The impact of $k$ with different inference $k$ value.
  • Figure 5: One model is trained to handle multi-degradations for denoising (a-b) and JPEG CAR (c-d).
  • ...and 14 more figures