Table of Contents
Fetching ...

GridFormer: Residual Dense Transformer with Grid Structure for Image Restoration in Adverse Weather Conditions

Tao Wang, Kaihao Zhang, Ziqian Shao, Wenhan Luo, Bjorn Stenger, Tong Lu, Tae-Kyun Kim, Wei Liu, Hongdong Li

TL;DR

GridFormer introduces a grid-structured transformer backbone for image restoration in adverse weather, integrating residual dense transformer blocks and a compact-enhanced transformer layer to efficiently fuse multi-scale features. Its three-path architecture (grid head, grid fusion, grid tail) enables cross-resolution information exchange, while the residual dense transformer block promotes feature reuse through dense connections and local residual learning. The method combines a Charbonnier loss with a perceptual term to produce sharp, semantically coherent restorations, delivering state-of-the-art results across deraining, dehazing, desnowing, and multi-weather tasks. This approach advances practical restoration performance under diverse weather conditions and offers a versatile backbone for downstream vision tasks in real-world settings.

Abstract

Image restoration in adverse weather conditions is a difficult task in computer vision. In this paper, we propose a novel transformer-based framework called GridFormer which serves as a backbone for image restoration under adverse weather conditions. GridFormer is designed in a grid structure using a residual dense transformer block, and it introduces two core designs. First, it uses an enhanced attention mechanism in the transformer layer. The mechanism includes stages of the sampler and compact self-attention to improve efficiency, and a local enhancement stage to strengthen local information. Second, we introduce a residual dense transformer block (RDTB) as the final GridFormer layer. This design further improves the network's ability to learn effective features from both preceding and current local features. The GridFormer framework achieves state-of-the-art results on five diverse image restoration tasks in adverse weather conditions, including image deraining, dehazing, deraining \& dehazing, desnowing, and multi-weather restoration. The source code and pre-trained models are available at https://github.com/TaoWangzj/GridFormer.

GridFormer: Residual Dense Transformer with Grid Structure for Image Restoration in Adverse Weather Conditions

TL;DR

GridFormer introduces a grid-structured transformer backbone for image restoration in adverse weather, integrating residual dense transformer blocks and a compact-enhanced transformer layer to efficiently fuse multi-scale features. Its three-path architecture (grid head, grid fusion, grid tail) enables cross-resolution information exchange, while the residual dense transformer block promotes feature reuse through dense connections and local residual learning. The method combines a Charbonnier loss with a perceptual term to produce sharp, semantically coherent restorations, delivering state-of-the-art results across deraining, dehazing, desnowing, and multi-weather tasks. This approach advances practical restoration performance under diverse weather conditions and offers a versatile backbone for downstream vision tasks in real-world settings.

Abstract

Image restoration in adverse weather conditions is a difficult task in computer vision. In this paper, we propose a novel transformer-based framework called GridFormer which serves as a backbone for image restoration under adverse weather conditions. GridFormer is designed in a grid structure using a residual dense transformer block, and it introduces two core designs. First, it uses an enhanced attention mechanism in the transformer layer. The mechanism includes stages of the sampler and compact self-attention to improve efficiency, and a local enhancement stage to strengthen local information. Second, we introduce a residual dense transformer block (RDTB) as the final GridFormer layer. This design further improves the network's ability to learn effective features from both preceding and current local features. The GridFormer framework achieves state-of-the-art results on five diverse image restoration tasks in adverse weather conditions, including image deraining, dehazing, deraining \& dehazing, desnowing, and multi-weather restoration. The source code and pre-trained models are available at https://github.com/TaoWangzj/GridFormer.
Paper Structure (16 sections, 8 equations, 15 figures, 11 tables)

This paper contains 16 sections, 8 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: Comparison results for image restoration in adverse weather conditions. Results on (top) weather-specific restoration, and (bottom) multi-weather restoration tasks, showing state-of-the-art performance in terms of PSNR.
  • Figure 2: GridFormer architecture. It consists of a grid head, a grid fusion module, and a grid tail. The pyramid degraded images $\mathbf{X}_{0}, \mathbf{X}_{1}, \mathbf{X}_{2}$ are first fed into the grid head to extract hierarchical initial features $\mathbf{F}_{0}, \mathbf{F}_{1}, \mathbf{F}_{2}$. The initial features are further refined by the grid fusion module to generate features $\hat{\mathbf{F}}_{0}, \hat{\mathbf{F}}_{1}, \hat{\mathbf{F}}_{2}$. Finally, the gird tail reconstructs clear images $\hat{\mathbf{X}}_{0}, \hat{\mathbf{X}}_{1}, \hat{\mathbf{X}}_{2}$.
  • Figure 3: Grid unit structure and information flow. (a) The structure of a single grid unit is comprised of four parts: the down-sampling layer, the GridFormer layer, the up-sampling layer, and attention fusion operations. RDTL refers to the proposed residual dense transformer layer. (b) Information flow of grid units in the fusion module.
  • Figure 4: The structure of the proposed Residual Dense Transformer Block (RDTB). It includes three residual dense transformer layers, a $1\times1$ convolution for local feature fusion, and a local skip connection for local residual learning. The residual dense transformer layer is mainly built by the proposed compact-enhance transformer layer, which contains the compact-enhanced self-attention and FFN.
  • Figure 5: Right: the schematic illustration of the proposed Compact-enhanced Transformer Layer consisting of a compact-enhanced attention and a Feed-Forward Network (FFN). Left: the compact-enhanced attention layer, which contains three steps, feature sampling, compact self-attention, and local enhancement. $H$, $W$, and $C$ denote the height, width, and numbers of feature channels, respectively. $r$ is the feature sampling rate. $\copyright$ and $\oplus$ refer to concatenate and element-wise summation operations respectively.
  • ...and 10 more figures