DSwinIR: Rethinking Window-based Attention for Image Restoration
Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu, Liqiang Nie
TL;DR
This work tackles the limitations of window-based attention in image restoration, notably boundary context truncation and fixed receptive fields. It introduces Deformable Sliding Window (DSwin) Attention, a token-centric and content-adaptive mechanism that combines sliding windows with learnable offsets to adapt the sampling pattern to image content, and embeds it in the DSwinIR backbone with multi-scale DSwin attention and a multi-scale gated FFN. Across all-in-one and single-task benchmarks, DSwinIR achieves competitive to state-of-the-art results, surpassing many backbone and prompt-based methods on PSNR/SSIM while maintaining efficiency. The findings suggest that a flexible, content-aware backbone can robustly handle diverse degradations and complex real-world artifacts, with potential to generalize to other dense prediction tasks.
Abstract
Image restoration has witnessed significant advancements with the development of deep learning models. Transformer-based models, particularly those using window-based self-attention, have become a dominant force. However, their performance is constrained by the rigid, non-overlapping window partitioning scheme, which leads to \textit{insufficient feature interaction across windows and limited receptive fields}. This highlights the need for more adaptive and flexible attention mechanisms. In this paper, we propose the Deformable Sliding Window Transformer for Image Restoration (DSwinIR), a new attention mechanism: the {Deformable Sliding Window (DSwin) Attention}. {This mechanism introduces a token-centric and content-aware paradigm that moves beyond the grid and fixed window partition.} It comprises two complementary components. First, it replaces the rigid partitioning with a \textit{token-centric sliding window} paradigm, {making it effective at eliminating boundary artifacts}. Second, it incorporates a \textit{content-aware deformable sampling} strategy, which allows the attention mechanism to learn data-dependent offsets and actively shape its receptive field to focus on the most informative image regions. Extensive experiments show that DSwinIR achieves strong results, including state-of-the-art performance on several evaluated benchmarks. For instance, in all-in-one image restoration, our DSwinIR surpasses the most recent backbone GridFormer by 0.53 dB on the three-task benchmark and 0.87 dB on the five-task benchmark.
