Table of Contents
Fetching ...

Joint multi-dimensional dynamic attention and transformer for general image restoration

Huan Zhang, Xu Zhang, Nian Cai, Jianglei Di, Yun Zhang

TL;DR

The paper addresses robust general image restoration under rain, haze, and noise by proposing MDDA-former, a U-Net–style architecture that uses CNN-based encoders/decoders with Multi-dimensional Dynamic Attention Blocks in the FPN and a latent-layer Efficient Transformer Block for global modeling. MDAB leverages MDConv with spatial, channel, and filter attentions to capture diverse local degradations, while ETB applies transposed self-attention with linear complexity to extract global cues efficiently. Across 18 benchmarks covering deraining, deblurring, denoising, dehazing, and low-light enhancement, MDDA-former demonstrates competitive or superior performance with reduced FLOPs and competitive latency, and it also shows improvements for high-level vision tasks. The work highlights a principled CNN-Transformer hybrid design in a U-shaped architecture that balances accuracy and efficiency, with potential for real-world deployment in weather-affected imaging and downstream tasks.

Abstract

Outdoor images often suffer from severe degradation due to rain, haze, and noise, impairing image quality and challenging high-level tasks. Current image restoration methods struggle to handle complex degradation while maintaining efficiency. This paper introduces a novel image restoration architecture that combines multi-dimensional dynamic attention and self-attention within a U-Net framework. To leverage the global modeling capabilities of transformers and the local modeling capabilities of convolutions, we integrate sole CNNs in the encoder-decoder and sole transformers in the latent layer. Additionally, we design convolutional kernels with selected multi-dimensional dynamic attention to capture diverse degraded inputs efficiently. A transformer block with transposed self-attention further enhances global feature extraction while maintaining efficiency. Extensive experiments demonstrate that our method achieves a better balance between performance and computational complexity across five image restoration tasks: deraining, deblurring, denoising, dehazing, and enhancement, as well as superior performance for high-level vision tasks. The source code will be available at https://github.com/House-yuyu/MDDA-former.

Joint multi-dimensional dynamic attention and transformer for general image restoration

TL;DR

The paper addresses robust general image restoration under rain, haze, and noise by proposing MDDA-former, a U-Net–style architecture that uses CNN-based encoders/decoders with Multi-dimensional Dynamic Attention Blocks in the FPN and a latent-layer Efficient Transformer Block for global modeling. MDAB leverages MDConv with spatial, channel, and filter attentions to capture diverse local degradations, while ETB applies transposed self-attention with linear complexity to extract global cues efficiently. Across 18 benchmarks covering deraining, deblurring, denoising, dehazing, and low-light enhancement, MDDA-former demonstrates competitive or superior performance with reduced FLOPs and competitive latency, and it also shows improvements for high-level vision tasks. The work highlights a principled CNN-Transformer hybrid design in a U-shaped architecture that balances accuracy and efficiency, with potential for real-world deployment in weather-affected imaging and downstream tasks.

Abstract

Outdoor images often suffer from severe degradation due to rain, haze, and noise, impairing image quality and challenging high-level tasks. Current image restoration methods struggle to handle complex degradation while maintaining efficiency. This paper introduces a novel image restoration architecture that combines multi-dimensional dynamic attention and self-attention within a U-Net framework. To leverage the global modeling capabilities of transformers and the local modeling capabilities of convolutions, we integrate sole CNNs in the encoder-decoder and sole transformers in the latent layer. Additionally, we design convolutional kernels with selected multi-dimensional dynamic attention to capture diverse degraded inputs efficiently. A transformer block with transposed self-attention further enhances global feature extraction while maintaining efficiency. Extensive experiments demonstrate that our method achieves a better balance between performance and computational complexity across five image restoration tasks: deraining, deblurring, denoising, dehazing, and enhancement, as well as superior performance for high-level vision tasks. The source code will be available at https://github.com/House-yuyu/MDDA-former.

Paper Structure

This paper contains 20 sections, 8 equations, 14 figures, 18 tables.

Figures (14)

  • Figure 1: The performance/latency comparisons among our method and the representative state-of-the-art methods on six datasets across five image restoration tasks.
  • Figure 2: U-shaped image restoration architectures. C, T, and H denote CNN, Transformer, and Hybrid block (CNN + Transformer) respectively. (a$\sim$e) The common U-shaped image Restoration architectures. (f) Our MDDA-former adopts CNN in encoder-decoder and embeds Transformer into latent to balance the complexity and the performance. (g) The performance/FLOPs comparisons among our method and the other architectures within the same framework, as detailed in Tab. \ref{['table:ablation_architecture']}.
  • Figure 3: The top part is the proposed framework of MDDA-former. For the encoder-decoder, it consists of CNN-based Multi-dimensional Dynamic Attention Block (MDAB). Effective Transformer Block (ETB) is embedded into the latent layer. The middle part is a detailed description of MDAB and ETB.
  • Figure 4: Comparison of conventional convolution and MDConv. (a) schematic of conventional convolution, $\mathbf{{W}}$ denotes static convolutional kernels of a group of filters, (b) schematic of MDConv, $\mathbf{{W}}_d$ represents the adjusted weight $\mathbf{{W}}$ by multiplying the learned $\bm{\alpha}_{s}$, $\bm{\alpha}_{c}$, and $\bm{\alpha}_{f}$ along spatial-wise, channel-wise, and filter-wise dimensions of $\mathbf{{W}}$.
  • Figure 5: Visual comparisons on synthetic rainy images sampled from Rain100L Rain100 dataset. Zoom-in for best view.
  • ...and 9 more figures