Joint multi-dimensional dynamic attention and transformer for general image restoration
Huan Zhang, Xu Zhang, Nian Cai, Jianglei Di, Yun Zhang
TL;DR
The paper addresses robust general image restoration under rain, haze, and noise by proposing MDDA-former, a U-Net–style architecture that uses CNN-based encoders/decoders with Multi-dimensional Dynamic Attention Blocks in the FPN and a latent-layer Efficient Transformer Block for global modeling. MDAB leverages MDConv with spatial, channel, and filter attentions to capture diverse local degradations, while ETB applies transposed self-attention with linear complexity to extract global cues efficiently. Across 18 benchmarks covering deraining, deblurring, denoising, dehazing, and low-light enhancement, MDDA-former demonstrates competitive or superior performance with reduced FLOPs and competitive latency, and it also shows improvements for high-level vision tasks. The work highlights a principled CNN-Transformer hybrid design in a U-shaped architecture that balances accuracy and efficiency, with potential for real-world deployment in weather-affected imaging and downstream tasks.
Abstract
Outdoor images often suffer from severe degradation due to rain, haze, and noise, impairing image quality and challenging high-level tasks. Current image restoration methods struggle to handle complex degradation while maintaining efficiency. This paper introduces a novel image restoration architecture that combines multi-dimensional dynamic attention and self-attention within a U-Net framework. To leverage the global modeling capabilities of transformers and the local modeling capabilities of convolutions, we integrate sole CNNs in the encoder-decoder and sole transformers in the latent layer. Additionally, we design convolutional kernels with selected multi-dimensional dynamic attention to capture diverse degraded inputs efficiently. A transformer block with transposed self-attention further enhances global feature extraction while maintaining efficiency. Extensive experiments demonstrate that our method achieves a better balance between performance and computational complexity across five image restoration tasks: deraining, deblurring, denoising, dehazing, and enhancement, as well as superior performance for high-level vision tasks. The source code will be available at https://github.com/House-yuyu/MDDA-former.
