Table of Contents
Fetching ...

Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining

Shangquan Sun, Wenqi Ren, Juxiang Zhou, Shu Wang, Jianhou Gan, Xiaochun Cao

TL;DR

This paper tackles the mismatch between synthetic and real-world rain in video deraining and the need for downstream task support. It introduces VDMamba, a dual-branch spatio-temporal state-space model with a spatial state-space model layer (S3ML) and a temporal state-space model layer (TSML), and a dynamic stacking filter (DSF) for adaptive frame fusion, plus a median stacking loss for semi-supervised learning. It also presents RVDT, a real-world benchmark for rain-affected object detection and tracking to evaluate practical impact. Experiments show state-of-the-art deraining performance on synthetic and real-world videos, real-time efficiency, and improved downstream task metrics after deraining.

Abstract

Significant progress has been made in video restoration under rainy conditions over the past decade, largely propelled by advancements in deep learning. Nevertheless, existing methods that depend on paired data struggle to generalize effectively to real-world scenarios, primarily due to the disparity between synthetic and authentic rain effects. To address these limitations, we propose a dual-branch spatio-temporal state-space model to enhance rain streak removal in video sequences. Specifically, we design spatial and temporal state-space model layers to extract spatial features and incorporate temporal dependencies across frames, respectively. To improve multi-frame feature fusion, we derive a dynamic stacking filter, which adaptively approximates statistical filters for superior pixel-wise feature refinement. Moreover, we develop a median stacking loss to enable semi-supervised learning by generating pseudo-clean patches based on the sparsity prior of rain. To further explore the capacity of deraining models in supporting other vision-based tasks in rainy environments, we introduce a novel real-world benchmark focused on object detection and tracking in rainy conditions. Our method is extensively evaluated across multiple benchmarks containing numerous synthetic and real-world rainy videos, consistently demonstrating its superiority in quantitative metrics, visual quality, efficiency, and its utility for downstream tasks.

Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining

TL;DR

This paper tackles the mismatch between synthetic and real-world rain in video deraining and the need for downstream task support. It introduces VDMamba, a dual-branch spatio-temporal state-space model with a spatial state-space model layer (S3ML) and a temporal state-space model layer (TSML), and a dynamic stacking filter (DSF) for adaptive frame fusion, plus a median stacking loss for semi-supervised learning. It also presents RVDT, a real-world benchmark for rain-affected object detection and tracking to evaluate practical impact. Experiments show state-of-the-art deraining performance on synthetic and real-world videos, real-time efficiency, and improved downstream task metrics after deraining.

Abstract

Significant progress has been made in video restoration under rainy conditions over the past decade, largely propelled by advancements in deep learning. Nevertheless, existing methods that depend on paired data struggle to generalize effectively to real-world scenarios, primarily due to the disparity between synthetic and authentic rain effects. To address these limitations, we propose a dual-branch spatio-temporal state-space model to enhance rain streak removal in video sequences. Specifically, we design spatial and temporal state-space model layers to extract spatial features and incorporate temporal dependencies across frames, respectively. To improve multi-frame feature fusion, we derive a dynamic stacking filter, which adaptively approximates statistical filters for superior pixel-wise feature refinement. Moreover, we develop a median stacking loss to enable semi-supervised learning by generating pseudo-clean patches based on the sparsity prior of rain. To further explore the capacity of deraining models in supporting other vision-based tasks in rainy environments, we introduce a novel real-world benchmark focused on object detection and tracking in rainy conditions. Our method is extensively evaluated across multiple benchmarks containing numerous synthetic and real-world rainy videos, consistently demonstrating its superiority in quantitative metrics, visual quality, efficiency, and its utility for downstream tasks.

Paper Structure

This paper contains 13 sections, 1 theorem, 24 equations, 8 figures, 3 tables.

Key Result

Theorem 1

Given a set of values $G=\left\{x_n\right\}_{n=1}^N$, its median is a solution to minimizing the mean absolute deviation, as expressed by the equation: where ${ \mathrm {median}}(\cdot)$ denotes the median of a set.

Figures (8)

  • Figure 1: The functionality of various stacking filters in video restoration. With aligned candidate frames, (a) shows that the min and median filters effectively remove falling rain streaks, while (b) and (c) illustrate how the mean and max filters aid in noise reduction and filling missing pixels, respectively. Our dynamic stacking filter can adaptively approximate these filters at the pixel level, facilitating versatile frame fusion.
  • Figure 2: The comparison of performance, time and space complexities among video deraining methods. The red line denotes the boundary of real-time inference.
  • Figure 3: Given two adjacent rainy frames (\ref{['fig:warp_example-rainy1']} & \ref{['fig:warp_example-rainy2']}), rain degradation leads to errors in optical flow estimation and subsequent frame alignment (\ref{['fig:warp_example-flow1']} & \ref{['fig:warp_example-warp1']}). In contrast, our degradation-free multi-frame estimation pipeline facilitates more accurate warping.
  • Figure 4: The architecture of our proposed VDMamba for video deraining, consisting of spatial state-space model layers (S3ML) for single-frame feature extraction and temporal state-space model layers (TSML) for multi-frame feature fusion. Due to the modular design of the spatial branch, the sub-model, VDMamba-S, which contains only S3ML, is also capable of performing single-image deraining.
  • Figure 5: The qualitative comparison among the existing deraining methods on synthetic datasets RainSynLight25Liu2018Erase, RainSynComplex25Liu2018Erase and NTURainChen2018RobustCNN. Please zoom in for better view.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Theorem 1