Joint multi-dimensional dynamic attention and transformer for general image restoration

Huan Zhang; Xu Zhang; Nian Cai; Jianglei Di; Yun Zhang

Joint multi-dimensional dynamic attention and transformer for general image restoration

Huan Zhang, Xu Zhang, Nian Cai, Jianglei Di, Yun Zhang

TL;DR

The paper addresses robust general image restoration under rain, haze, and noise by proposing MDDA-former, a U-Net–style architecture that uses CNN-based encoders/decoders with Multi-dimensional Dynamic Attention Blocks in the FPN and a latent-layer Efficient Transformer Block for global modeling. MDAB leverages MDConv with spatial, channel, and filter attentions to capture diverse local degradations, while ETB applies transposed self-attention with linear complexity to extract global cues efficiently. Across 18 benchmarks covering deraining, deblurring, denoising, dehazing, and low-light enhancement, MDDA-former demonstrates competitive or superior performance with reduced FLOPs and competitive latency, and it also shows improvements for high-level vision tasks. The work highlights a principled CNN-Transformer hybrid design in a U-shaped architecture that balances accuracy and efficiency, with potential for real-world deployment in weather-affected imaging and downstream tasks.

Abstract

Outdoor images often suffer from severe degradation due to rain, haze, and noise, impairing image quality and challenging high-level tasks. Current image restoration methods struggle to handle complex degradation while maintaining efficiency. This paper introduces a novel image restoration architecture that combines multi-dimensional dynamic attention and self-attention within a U-Net framework. To leverage the global modeling capabilities of transformers and the local modeling capabilities of convolutions, we integrate sole CNNs in the encoder-decoder and sole transformers in the latent layer. Additionally, we design convolutional kernels with selected multi-dimensional dynamic attention to capture diverse degraded inputs efficiently. A transformer block with transposed self-attention further enhances global feature extraction while maintaining efficiency. Extensive experiments demonstrate that our method achieves a better balance between performance and computational complexity across five image restoration tasks: deraining, deblurring, denoising, dehazing, and enhancement, as well as superior performance for high-level vision tasks. The source code will be available at https://github.com/House-yuyu/MDDA-former.

Joint multi-dimensional dynamic attention and transformer for general image restoration

TL;DR

Abstract

Joint multi-dimensional dynamic attention and transformer for general image restoration

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)