Table of Contents
Fetching ...

Efficient Concertormer for Image Deblurring and Beyond

Pin-Hung Kuo, Jinshan Pan, Shao-Yi Chien, Ming-Hsuan Yang

TL;DR

Concertormer introduces Concerto Self-Attention (CSA) to capture both global and local dependencies with linear complexity, addressing the standard self-attention bottleneck in high-resolution image restoration. By decomposing attention into Concertino (global) and Ripieno (local) and incorporating Cross-Dimensional Communication, plus a single-stage gated-dconv MLP (gdMLP) to replace the traditional FFN, it achieves favorable PSNR/SSIM with reduced FLOPs across deblurring, JPEG-artifact restoration, and deraining. Extensive experiments on GoPro, HIDE, RealBlur, and related tasks demonstrate state-of-the-art performance and efficiency, with ablations validating the contributions of CSA, CDC, and gdMLP. The framework further shows potential for broader applicability in other image restoration problems, offering a practical, scalable Transformer-based solution for high-resolution vision tasks.

Abstract

The Transformer architecture has achieved remarkable success in natural language processing and high-level vision tasks over the past few years. However, the inherent complexity of self-attention is quadratic to the size of the image, leading to unaffordable computational costs for high-resolution vision tasks. In this paper, we introduce Concertormer, featuring a novel Concerto Self-Attention (CSA) mechanism designed for image deblurring. The proposed CSA divides self-attention into two distinct components: one emphasizes generally global and another concentrates on specifically local correspondence. By retaining partial information in additional dimensions independent from the self-attention calculations, our method effectively captures global contextual representations with complexity linear to the image size. To effectively leverage the additional dimensions, we present a Cross-Dimensional Communication module, which linearly combines attention maps and thus enhances expressiveness. Moreover, we amalgamate the two-staged Transformer design into a single stage using the proposed gated-dconv MLP architecture. While our primary objective is single-image motion deblurring, extensive quantitative and qualitative evaluations demonstrate that our approach performs favorably against the state-of-the-art methods in other tasks, such as deraining and deblurring with JPEG artifacts. The source codes and trained models will be made available to the public.

Efficient Concertormer for Image Deblurring and Beyond

TL;DR

Concertormer introduces Concerto Self-Attention (CSA) to capture both global and local dependencies with linear complexity, addressing the standard self-attention bottleneck in high-resolution image restoration. By decomposing attention into Concertino (global) and Ripieno (local) and incorporating Cross-Dimensional Communication, plus a single-stage gated-dconv MLP (gdMLP) to replace the traditional FFN, it achieves favorable PSNR/SSIM with reduced FLOPs across deblurring, JPEG-artifact restoration, and deraining. Extensive experiments on GoPro, HIDE, RealBlur, and related tasks demonstrate state-of-the-art performance and efficiency, with ablations validating the contributions of CSA, CDC, and gdMLP. The framework further shows potential for broader applicability in other image restoration problems, offering a practical, scalable Transformer-based solution for high-resolution vision tasks.

Abstract

The Transformer architecture has achieved remarkable success in natural language processing and high-level vision tasks over the past few years. However, the inherent complexity of self-attention is quadratic to the size of the image, leading to unaffordable computational costs for high-resolution vision tasks. In this paper, we introduce Concertormer, featuring a novel Concerto Self-Attention (CSA) mechanism designed for image deblurring. The proposed CSA divides self-attention into two distinct components: one emphasizes generally global and another concentrates on specifically local correspondence. By retaining partial information in additional dimensions independent from the self-attention calculations, our method effectively captures global contextual representations with complexity linear to the image size. To effectively leverage the additional dimensions, we present a Cross-Dimensional Communication module, which linearly combines attention maps and thus enhances expressiveness. Moreover, we amalgamate the two-staged Transformer design into a single stage using the proposed gated-dconv MLP architecture. While our primary objective is single-image motion deblurring, extensive quantitative and qualitative evaluations demonstrate that our approach performs favorably against the state-of-the-art methods in other tasks, such as deraining and deblurring with JPEG artifacts. The source codes and trained models will be made available to the public.
Paper Structure (19 sections, 12 equations, 20 figures, 8 tables)

This paper contains 19 sections, 12 equations, 20 figures, 8 tables.

Figures (20)

  • Figure 1: PSNR vs FLOPs on the HIDEshen2019human dataset. Our models are marked in blue and magenta text. The area of circles is proportional to the number of parameters.
  • Figure 2: Self-attention methods. (a) Input tensor. (b) Typical self-attention. (c) Window multi-head self-attention liu2021swinliu2022swin, where $n=hw/k^2$ is the number of blocks. (d) Transposed self-attention zamir2022restormer and a random column permutation. Column permutations do not affect the resulting attention map.
  • Figure 3: Network architecture. The overall network is shown on the left, and the sub-modules are on the right. XA: cross-attention, SA: self-attention. We use XA for the first blocks of $L_2 - L_7$, and SA for the remaining blocks.
  • Figure 4: Attention maps. (a) Typical self-attention. The lighter color represents a smaller attention value. (b) Window multi-head self-attention (c) Concerto Self-Attention. Each block on the diagonal shares the same Concertino and has its own Ripieno component.
  • Figure 5: Single-image motion deblurring on the GoPronah2017deep dataset.
  • ...and 15 more figures