Table of Contents
Fetching ...

MB-TaylorFormer V2: Improved Multi-branch Linear Transformer Expanded by Taylor Formula for Image Restoration

Zhi Jin, Yuwei Qiu, Kaihao Zhang, Hongdong Li, Wenhan Luo

TL;DR

Image restoration with Transformers is hampered by quadratic attention costs and fixed token scales. MB-TaylorFormer V2 introduces Taylor-expanded self-attention (T-MSA++) with a norm-preserving remainder to achieve linear complexity $\mathcal{O}(hw)$, and combines this with a multi-branch, multi-scale patch embedding to enable diverse receptive fields. Across five restoration tasks—dehazing, deraining, desnowing, motion deblurring, and denoising—it sets new state-of-the-art results while using fewer parameters and MACs than existing methods, thanks to parallel branches and convolutional positional encoding. The work advances efficient, high-performance Transformers for low-level vision and provides open-source code for replication.

Abstract

Recently, Transformer networks have demonstrated outstanding performance in the field of image restoration due to the global receptive field and adaptability to input. However, the quadratic computational complexity of Softmax-attention poses a significant limitation on its extensive application in image restoration tasks, particularly for high-resolution images. To tackle this challenge, we propose a novel variant of the Transformer. This variant leverages the Taylor expansion to approximate the Softmax-attention and utilizes the concept of norm-preserving mapping to approximate the remainder of the first-order Taylor expansion, resulting in a linear computational complexity. Moreover, we introduce a multi-branch architecture featuring multi-scale patch embedding into the proposed Transformer, which has four distinct advantages: 1) various sizes of the receptive field; 2) multi-level semantic information; 3) flexible shapes of the receptive field; 4) accelerated training and inference speed. Hence, the proposed model, named the second version of Taylor formula expansion-based Transformer (for short MB-TaylorFormer V2) has the capability to concurrently process coarse-to-fine features, capture long-distance pixel interactions with limited computational cost, and improve the approximation of the Taylor expansion remainder. Experimental results across diverse image restoration benchmarks demonstrate that MB-TaylorFormer V2 achieves state-of-the-art performance in multiple image restoration tasks, such as image dehazing, deraining, desnowing, motion deblurring, and denoising, with very little computational overhead. The source code is available at https://github.com/FVL2020/MB-TaylorFormerV2.

MB-TaylorFormer V2: Improved Multi-branch Linear Transformer Expanded by Taylor Formula for Image Restoration

TL;DR

Image restoration with Transformers is hampered by quadratic attention costs and fixed token scales. MB-TaylorFormer V2 introduces Taylor-expanded self-attention (T-MSA++) with a norm-preserving remainder to achieve linear complexity , and combines this with a multi-branch, multi-scale patch embedding to enable diverse receptive fields. Across five restoration tasks—dehazing, deraining, desnowing, motion deblurring, and denoising—it sets new state-of-the-art results while using fewer parameters and MACs than existing methods, thanks to parallel branches and convolutional positional encoding. The work advances efficient, high-performance Transformers for low-level vision and provides open-source code for replication.

Abstract

Recently, Transformer networks have demonstrated outstanding performance in the field of image restoration due to the global receptive field and adaptability to input. However, the quadratic computational complexity of Softmax-attention poses a significant limitation on its extensive application in image restoration tasks, particularly for high-resolution images. To tackle this challenge, we propose a novel variant of the Transformer. This variant leverages the Taylor expansion to approximate the Softmax-attention and utilizes the concept of norm-preserving mapping to approximate the remainder of the first-order Taylor expansion, resulting in a linear computational complexity. Moreover, we introduce a multi-branch architecture featuring multi-scale patch embedding into the proposed Transformer, which has four distinct advantages: 1) various sizes of the receptive field; 2) multi-level semantic information; 3) flexible shapes of the receptive field; 4) accelerated training and inference speed. Hence, the proposed model, named the second version of Taylor formula expansion-based Transformer (for short MB-TaylorFormer V2) has the capability to concurrently process coarse-to-fine features, capture long-distance pixel interactions with limited computational cost, and improve the approximation of the Taylor expansion remainder. Experimental results across diverse image restoration benchmarks demonstrate that MB-TaylorFormer V2 achieves state-of-the-art performance in multiple image restoration tasks, such as image dehazing, deraining, desnowing, motion deblurring, and denoising, with very little computational overhead. The source code is available at https://github.com/FVL2020/MB-TaylorFormerV2.
Paper Structure (20 sections, 19 equations, 15 figures, 12 tables, 1 algorithm)

This paper contains 20 sections, 19 equations, 15 figures, 12 tables, 1 algorithm.

Figures (15)

  • Figure 1: Improvement of MB-TaylorFormer V2 over the SOTA approaches. The circle size is proportional to the number of model parameters.
  • Figure 2: Architecture of MB-TaylorFormer V2. (a) MB-TaylorFormer V2 consists of the multi-branch hierarchical design based on multi-scale patch embedding. (b) Multi-scale patch embedding embeds coarse-to-fine patches. (c) T-MSA++ with linear computational complexity.
  • Figure 3: Illustration of DSDCN. The offsets are generated by $K \times K$ depthwise convolutions and pointwise convolutions, and the output is generated by $K \times K$ depthwise deformable convolutions and pointwise convolutions. D represents the number of channels of the feature maps.
  • Figure 4: Illustration of the receptive field of DSDCN (the offsets are truncated to [-3,3]). The upper bound of the receptive field of the DSDCN is $9\times 9$ and the lower bound is $1\times 1$.
  • Figure 5: $\mathbf{e^{x}}$ (orange) and its first-order Taylor expansion curve (blue). The closer the value of $x$ to 0, the tighter the approximation of the orange line to the blue line.
  • ...and 10 more figures