Table of Contents
Fetching ...

RainMamba: Enhanced Locality Learning with State Space Models for Video Deraining

Hongtao Wu, Yijun Yang, Huihui Xu, Weiming Wang, Jinni Zhou, Lei Zhu

TL;DR

RainMamba introduces an enhanced state-space modeling approach for video deraining by combining a Hilbert-based local scanning strategy with coarse-to-fine Mamba blocks to capture both global and local spatio-temporal dependencies. A difference-guided dynamic contrastive locality learning module strengthens patch-level self-similarity, enabling robust restoration of rain streaks and raindrops. The method achieves state-of-the-art performance across four synthetic and real-world video rain datasets while maintaining favorable computational efficiency due to the linear complexity of SSMs. Empirical results, ablations, and efficiency analyses demonstrate RainMamba’s strong practical impact for real-time outdoor video pre-processing. The work positions SSM-based vision models as a competitive baseline for low-level video restoration tasks.

Abstract

The outdoor vision systems are frequently contaminated by rain streaks and raindrops, which significantly degenerate the performance of visual tasks and multimedia applications. The nature of videos exhibits redundant temporal cues for rain removal with higher stability. Traditional video deraining methods heavily rely on optical flow estimation and kernel-based manners, which have a limited receptive field. Yet, transformer architectures, while enabling long-term dependencies, bring about a significant increase in computational complexity. Recently, the linear-complexity operator of the state space models (SSMs) has contrarily facilitated efficient long-term temporal modeling, which is crucial for rain streaks and raindrops removal in videos. Unexpectedly, its uni-dimensional sequential process on videos destroys the local correlations across the spatio-temporal dimension by distancing adjacent pixels. To address this, we present an improved SSMs-based video deraining network (RainMamba) with a novel Hilbert scanning mechanism to better capture sequence-level local information. We also introduce a difference-guided dynamic contrastive locality learning strategy to enhance the patch-level self-similarity learning ability of the proposed network. Extensive experiments on four synthesized video deraining datasets and real-world rainy videos demonstrate the effectiveness and efficiency of our network in the removal of rain streaks and raindrops. Our code and results are available at https://github.com/TonyHongtaoWu/RainMamba.

RainMamba: Enhanced Locality Learning with State Space Models for Video Deraining

TL;DR

RainMamba introduces an enhanced state-space modeling approach for video deraining by combining a Hilbert-based local scanning strategy with coarse-to-fine Mamba blocks to capture both global and local spatio-temporal dependencies. A difference-guided dynamic contrastive locality learning module strengthens patch-level self-similarity, enabling robust restoration of rain streaks and raindrops. The method achieves state-of-the-art performance across four synthetic and real-world video rain datasets while maintaining favorable computational efficiency due to the linear complexity of SSMs. Empirical results, ablations, and efficiency analyses demonstrate RainMamba’s strong practical impact for real-time outdoor video pre-processing. The work positions SSM-based vision models as a competitive baseline for low-level video restoration tasks.

Abstract

The outdoor vision systems are frequently contaminated by rain streaks and raindrops, which significantly degenerate the performance of visual tasks and multimedia applications. The nature of videos exhibits redundant temporal cues for rain removal with higher stability. Traditional video deraining methods heavily rely on optical flow estimation and kernel-based manners, which have a limited receptive field. Yet, transformer architectures, while enabling long-term dependencies, bring about a significant increase in computational complexity. Recently, the linear-complexity operator of the state space models (SSMs) has contrarily facilitated efficient long-term temporal modeling, which is crucial for rain streaks and raindrops removal in videos. Unexpectedly, its uni-dimensional sequential process on videos destroys the local correlations across the spatio-temporal dimension by distancing adjacent pixels. To address this, we present an improved SSMs-based video deraining network (RainMamba) with a novel Hilbert scanning mechanism to better capture sequence-level local information. We also introduce a difference-guided dynamic contrastive locality learning strategy to enhance the patch-level self-similarity learning ability of the proposed network. Extensive experiments on four synthesized video deraining datasets and real-world rainy videos demonstrate the effectiveness and efficiency of our network in the removal of rain streaks and raindrops. Our code and results are available at https://github.com/TonyHongtaoWu/RainMamba.
Paper Structure (41 sections, 14 equations, 13 figures, 9 tables)

This paper contains 41 sections, 14 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Motivation illustration and visual comparisons of two different scanning methods. (a) and (c) are the illustration of the global scan method, while (b) and (d) are the illustration of the Hilbert scan method. The temporal scanning differences are emphasized in (a) and (b), whereas the spatial scanning differences are depicted in (c) and (d). The lines and endpoints (represent pixel points) are shaded in gradients from dark to light, signifying the path of the scan. For a more intuitive understanding, please refer to the dynamic display in the Supplementary Video. We leverage the Hilbert curve's locality feature to improve the utilization of local information in the time-space dimension during the scanning process. The visual results indicate that our local scanning mechanism improves spatial structure preservation of derived results.
  • Figure 2: The architecture of our proposed framework RainMamba for video deraining task. Given a sequence of rainy video frames, the cascading Coarse-to-Fine Mamba Module (CFM) receives the encoded features as input and causally models temporal corrections by the improved state space models (SSMs). The CFM employs Global Mamba Block (GMB) and Local Mamba Block (LMB) to capture sequence-level global and local spatio-temporal dependencies. We develop a novel Hilbert scanning paradigm in LMB to promote the Mamba's locality learning. Moreover, we construct a difference-guided dynamic contrastive locality learning approach to enhance patch-level locality learning. Specifically, we utilize the difference between the input and the ground truth to select the anchor, sampling the positive patch at a spatio-temporally adjacent location to the anchor, and the negative patch at a more distant location. As training progresses, the sampling space for positive samples expands while that for negative samples contracts.
  • Figure 3: The motivation and operation of our proposed Difference-Guided Dynamic Contrastive Locality Learning.
  • Figure 4: Visual comparisons of derained results from our network and state-of-the-art deraining methods on input video frames from the VRDS dataset. (Please zoom in for a better illustration.)
  • Figure 5: Visual comparisons of derained results produced by our network and state-of-the-art deraining methods on input video frames from real-world rainy videos. (Please zoom in for a better illustration.)
  • ...and 8 more figures