Table of Contents
Fetching ...

Cross Paradigm Representation and Alignment Transformer for Image Deraining

Shun Zou, Yi Zou, Juncheng Li, Guangwei Gao, Guojun Qi

TL;DR

This work introduces CPRAformer, a cross-paradigm Transformer for image deraining that fuses global-local and spatial-channel representations via CPIA-SA, which combines SPC-SA and SPR-SA. Key innovations include the Efficient Prompt Guide Operator for dynamic sparsity, the Adaptive Alignment Frequency Module for two-stage feature fusion in the frequency domain, and the Multi-Scale Flow Gating Network for scale-aware representation. Together, these components enable robust cross-paradigm interaction and hierarchical feature alignment, achieving state-of-the-art results across eight datasets and demonstrating strong generalization to dehazing and downstream vision tasks. The approach offers a principled path to leveraging complementary representations in low-level vision, with practical impact on real-world rain removal and related restoration challenges.

Abstract

Transformer-based networks have achieved strong performance in low-level vision tasks like image deraining by utilizing spatial or channel-wise self-attention. However, irregular rain patterns and complex geometric overlaps challenge single-paradigm architectures, necessitating a unified framework to integrate complementary global-local and spatial-channel representations. To address this, we propose a novel Cross Paradigm Representation and Alignment Transformer (CPRAformer). Its core idea is the hierarchical representation and alignment, leveraging the strengths of both paradigms (spatial-channel and global-local) to aid image reconstruction. It bridges the gap within and between paradigms, aligning and coordinating them to enable deep interaction and fusion of features. Specifically, we use two types of self-attention in the Transformer blocks: sparse prompt channel self-attention (SPC-SA) and spatial pixel refinement self-attention (SPR-SA). SPC-SA enhances global channel dependencies through dynamic sparsity, while SPR-SA focuses on spatial rain distribution and fine-grained texture recovery. To address the feature misalignment and knowledge differences between them, we introduce the Adaptive Alignment Frequency Module (AAFM), which aligns and interacts with features in a two-stage progressive manner, enabling adaptive guidance and complementarity. This reduces the information gap within and between paradigms. Through this unified cross-paradigm dynamic interaction framework, we achieve the extraction of the most valuable interactive fusion information from the two paradigms. Extensive experiments demonstrate that our model achieves state-of-the-art performance on eight benchmark datasets and further validates CPRAformer's robustness in other image restoration tasks and downstream applications.

Cross Paradigm Representation and Alignment Transformer for Image Deraining

TL;DR

This work introduces CPRAformer, a cross-paradigm Transformer for image deraining that fuses global-local and spatial-channel representations via CPIA-SA, which combines SPC-SA and SPR-SA. Key innovations include the Efficient Prompt Guide Operator for dynamic sparsity, the Adaptive Alignment Frequency Module for two-stage feature fusion in the frequency domain, and the Multi-Scale Flow Gating Network for scale-aware representation. Together, these components enable robust cross-paradigm interaction and hierarchical feature alignment, achieving state-of-the-art results across eight datasets and demonstrating strong generalization to dehazing and downstream vision tasks. The approach offers a principled path to leveraging complementary representations in low-level vision, with practical impact on real-world rain removal and related restoration challenges.

Abstract

Transformer-based networks have achieved strong performance in low-level vision tasks like image deraining by utilizing spatial or channel-wise self-attention. However, irregular rain patterns and complex geometric overlaps challenge single-paradigm architectures, necessitating a unified framework to integrate complementary global-local and spatial-channel representations. To address this, we propose a novel Cross Paradigm Representation and Alignment Transformer (CPRAformer). Its core idea is the hierarchical representation and alignment, leveraging the strengths of both paradigms (spatial-channel and global-local) to aid image reconstruction. It bridges the gap within and between paradigms, aligning and coordinating them to enable deep interaction and fusion of features. Specifically, we use two types of self-attention in the Transformer blocks: sparse prompt channel self-attention (SPC-SA) and spatial pixel refinement self-attention (SPR-SA). SPC-SA enhances global channel dependencies through dynamic sparsity, while SPR-SA focuses on spatial rain distribution and fine-grained texture recovery. To address the feature misalignment and knowledge differences between them, we introduce the Adaptive Alignment Frequency Module (AAFM), which aligns and interacts with features in a two-stage progressive manner, enabling adaptive guidance and complementarity. This reduces the information gap within and between paradigms. Through this unified cross-paradigm dynamic interaction framework, we achieve the extraction of the most valuable interactive fusion information from the two paradigms. Extensive experiments demonstrate that our model achieves state-of-the-art performance on eight benchmark datasets and further validates CPRAformer's robustness in other image restoration tasks and downstream applications.

Paper Structure

This paper contains 16 sections, 15 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: Feature patterns obtained from four perspectives are distinct, two deraining paradigms offers unique advantages. Recent deraining research mainly focuses on spatial-channel or global-local paradigms, lacking a framework that effectively integrates these two paradigms.
  • Figure 2: The overall architecture of our proposed CPRAformer.
  • Figure 3: Comparison of different self-attention mechanisms. (a) The naive self-attention mechanism Zamir2021Restormer computes and retains all tokens. (b) The Top-K sparse attention mechanism Chen_2023_CVPR sets a fixed K value (here, K is set to 1/4) and retains only the top K% tokens with the highest attention values while setting the remaining tokens to zero. (c) Our dynamic Top-K sparse attention mechanism adaptively modulates the K value based on input features. For instance, compared to the fixed K in (b), the K value increases in the upper image and decreases in the lower image to adapt to different images.
  • Figure 4: The qualitative comparison on Test100 zhang2019image. See the supplements for more visualizations.
  • Figure 5: The qualitative comparison on raindrop datasets Qian_2018_CVPR. Our result has the best visual quality and details.
  • ...and 5 more figures