Efficient Concertormer for Image Deblurring and Beyond
Pin-Hung Kuo, Jinshan Pan, Shao-Yi Chien, Ming-Hsuan Yang
TL;DR
Concertormer introduces Concerto Self-Attention (CSA) to capture both global and local dependencies with linear complexity, addressing the standard self-attention bottleneck in high-resolution image restoration. By decomposing attention into Concertino (global) and Ripieno (local) and incorporating Cross-Dimensional Communication, plus a single-stage gated-dconv MLP (gdMLP) to replace the traditional FFN, it achieves favorable PSNR/SSIM with reduced FLOPs across deblurring, JPEG-artifact restoration, and deraining. Extensive experiments on GoPro, HIDE, RealBlur, and related tasks demonstrate state-of-the-art performance and efficiency, with ablations validating the contributions of CSA, CDC, and gdMLP. The framework further shows potential for broader applicability in other image restoration problems, offering a practical, scalable Transformer-based solution for high-resolution vision tasks.
Abstract
The Transformer architecture has achieved remarkable success in natural language processing and high-level vision tasks over the past few years. However, the inherent complexity of self-attention is quadratic to the size of the image, leading to unaffordable computational costs for high-resolution vision tasks. In this paper, we introduce Concertormer, featuring a novel Concerto Self-Attention (CSA) mechanism designed for image deblurring. The proposed CSA divides self-attention into two distinct components: one emphasizes generally global and another concentrates on specifically local correspondence. By retaining partial information in additional dimensions independent from the self-attention calculations, our method effectively captures global contextual representations with complexity linear to the image size. To effectively leverage the additional dimensions, we present a Cross-Dimensional Communication module, which linearly combines attention maps and thus enhances expressiveness. Moreover, we amalgamate the two-staged Transformer design into a single stage using the proposed gated-dconv MLP architecture. While our primary objective is single-image motion deblurring, extensive quantitative and qualitative evaluations demonstrate that our approach performs favorably against the state-of-the-art methods in other tasks, such as deraining and deblurring with JPEG artifacts. The source codes and trained models will be made available to the public.
