Table of Contents
Fetching ...

DSwinIR: Rethinking Window-based Attention for Image Restoration

Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu, Liqiang Nie

TL;DR

This work tackles the limitations of window-based attention in image restoration, notably boundary context truncation and fixed receptive fields. It introduces Deformable Sliding Window (DSwin) Attention, a token-centric and content-adaptive mechanism that combines sliding windows with learnable offsets to adapt the sampling pattern to image content, and embeds it in the DSwinIR backbone with multi-scale DSwin attention and a multi-scale gated FFN. Across all-in-one and single-task benchmarks, DSwinIR achieves competitive to state-of-the-art results, surpassing many backbone and prompt-based methods on PSNR/SSIM while maintaining efficiency. The findings suggest that a flexible, content-aware backbone can robustly handle diverse degradations and complex real-world artifacts, with potential to generalize to other dense prediction tasks.

Abstract

Image restoration has witnessed significant advancements with the development of deep learning models. Transformer-based models, particularly those using window-based self-attention, have become a dominant force. However, their performance is constrained by the rigid, non-overlapping window partitioning scheme, which leads to \textit{insufficient feature interaction across windows and limited receptive fields}. This highlights the need for more adaptive and flexible attention mechanisms. In this paper, we propose the Deformable Sliding Window Transformer for Image Restoration (DSwinIR), a new attention mechanism: the {Deformable Sliding Window (DSwin) Attention}. {This mechanism introduces a token-centric and content-aware paradigm that moves beyond the grid and fixed window partition.} It comprises two complementary components. First, it replaces the rigid partitioning with a \textit{token-centric sliding window} paradigm, {making it effective at eliminating boundary artifacts}. Second, it incorporates a \textit{content-aware deformable sampling} strategy, which allows the attention mechanism to learn data-dependent offsets and actively shape its receptive field to focus on the most informative image regions. Extensive experiments show that DSwinIR achieves strong results, including state-of-the-art performance on several evaluated benchmarks. For instance, in all-in-one image restoration, our DSwinIR surpasses the most recent backbone GridFormer by 0.53 dB on the three-task benchmark and 0.87 dB on the five-task benchmark.

DSwinIR: Rethinking Window-based Attention for Image Restoration

TL;DR

This work tackles the limitations of window-based attention in image restoration, notably boundary context truncation and fixed receptive fields. It introduces Deformable Sliding Window (DSwin) Attention, a token-centric and content-adaptive mechanism that combines sliding windows with learnable offsets to adapt the sampling pattern to image content, and embeds it in the DSwinIR backbone with multi-scale DSwin attention and a multi-scale gated FFN. Across all-in-one and single-task benchmarks, DSwinIR achieves competitive to state-of-the-art results, surpassing many backbone and prompt-based methods on PSNR/SSIM while maintaining efficiency. The findings suggest that a flexible, content-aware backbone can robustly handle diverse degradations and complex real-world artifacts, with potential to generalize to other dense prediction tasks.

Abstract

Image restoration has witnessed significant advancements with the development of deep learning models. Transformer-based models, particularly those using window-based self-attention, have become a dominant force. However, their performance is constrained by the rigid, non-overlapping window partitioning scheme, which leads to \textit{insufficient feature interaction across windows and limited receptive fields}. This highlights the need for more adaptive and flexible attention mechanisms. In this paper, we propose the Deformable Sliding Window Transformer for Image Restoration (DSwinIR), a new attention mechanism: the {Deformable Sliding Window (DSwin) Attention}. {This mechanism introduces a token-centric and content-aware paradigm that moves beyond the grid and fixed window partition.} It comprises two complementary components. First, it replaces the rigid partitioning with a \textit{token-centric sliding window} paradigm, {making it effective at eliminating boundary artifacts}. Second, it incorporates a \textit{content-aware deformable sampling} strategy, which allows the attention mechanism to learn data-dependent offsets and actively shape its receptive field to focus on the most informative image regions. Extensive experiments show that DSwinIR achieves strong results, including state-of-the-art performance on several evaluated benchmarks. For instance, in all-in-one image restoration, our DSwinIR surpasses the most recent backbone GridFormer by 0.53 dB on the three-task benchmark and 0.87 dB on the five-task benchmark.

Paper Structure

This paper contains 46 sections, 5 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Comparative analysis of feature extraction mechanisms with an anchor token (marked by $\star$) as the reference point. (a) Vanilla convolution applies a fixed sampling pattern, leveraging neighborhood features. (b) Deformable convolution introduces adaptive sampling locations based on content, enabling more effective feature integration from relevant regions. (c) Window attention suffers from boundary constraints where anchor tokens near window edges (especially corners) have limited receptive fields. (d) The proposed Deformable Sliding Window (DSwin) attention extends window attention with a token-centric paradigm and the content-aware sampling, bringing robust feature aggregation for anchor tokens.
  • Figure 2: Quantitative comparison of the proposed DSwinIR against existing methods across diverse image restoration tasks, achieving consistent superior performance. All metrics are reported in PSNR (dB).
  • Figure 3: Overview of the proposed DSwinIR architecture. The model is built upon a U-shaped backbone where the core component is the DSwin Transformer Block (DSTB). The key modules include: (a) Deformable Sliding Window Attention (DSwin), which adaptively samples features by learning content-dependent offsets; (b) Multi-Scale Deformable Sliding Window (MS-DSwin) Attention, which integrates multi-scale DSwin attention across multiple attention heads; and (c) Multi-Scale Gated Feed-Forward Network (MSG-FFN), which leverages parallel convolutional branches to enhance feature representation.
  • Figure 4: Visualization of the content-aware sampling mechanism in DSwinIR. The figure presents a qualitative analysis of our proposed DSwin Attention. (a) The input image, with a red dot indicating the anchor location for local analysis. (b) The corresponding local sampling pattern, showing the reference grid, the adaptively sampled points, and the learned offsets. (c) The Deformation Magnitude Map, which visualizes the offset distance ($d = \sqrt{dx^2 + dy^2}$) as a heatmap, where brighter intensity signifies a larger offset.
  • Figure 5: Visual comparison of restoration results across three degradation tasks: noise removal (top row), rain streak removal (middle row), and dehazing (bottom row). Zoom-in regions (shown in colored boxes) demonstrate that our method achieves superior detail preservation and degradation removal.
  • ...and 5 more figures