Table of Contents
Fetching ...

Semi-Supervised Video Desnowing Network via Temporal Decoupling Experts and Distribution-Driven Contrastive Regularization

Hongtao Wu, Yijun Yang, Angelica I Aviles-Rivero, Jingjing Ren, Sixiang Chen, Haoyu Chen, Lei Zhu

TL;DR

Snow degradation in outdoor videos challenges vision systems due to distribution shift between synthetic and real snow. The authors introduce SemiVDN, a semi-supervised video desnowing network that leverages unlabeled real snow videos via a Mean-Teacher framework and a Distribution-driven Contrastive Regularization to bridge the synthetic-real gap. A Prior-guided Temporal Decoupling Experts module explicitly decomposes snow degradation into snow, transmission, and atmospheric-light components within a physics-informed Transformer, while a GMM-based ultra-positive sampling strategy enhances alignment with real snow distributions. Evaluations on RVSD and RealSnow85 show state-of-the-art performance and improved real-world generalization, with notable gains in PSNR, SSIM, and perceptual metrics, supporting its practical applicability in real-world snowy scenes.

Abstract

Snow degradations present formidable challenges to the advancement of computer vision tasks by the undesirable corruption in outdoor scenarios. While current deep learning-based desnowing approaches achieve success on synthetic benchmark datasets, they struggle to restore out-of-distribution real-world snowy videos due to the deficiency of paired real-world training data. To address this bottleneck, we devise a new paradigm for video desnowing in a semi-supervised spirit to involve unlabeled real data for the generalizable snow removal. Specifically, we construct a real-world dataset with 85 snowy videos, and then present a Semi-supervised Video Desnowing Network (SemiVDN) equipped by a novel Distribution-driven Contrastive Regularization. The elaborated contrastive regularization mitigates the distribution gap between the synthetic and real data, and consequently maintains the desired snow-invariant background details. Furthermore, based on the atmospheric scattering model, we introduce a Prior-guided Temporal Decoupling Experts module to decompose the physical components that make up a snowy video in a frame-correlated manner. We evaluate our SemiVDN on benchmark datasets and the collected real snowy data. The experimental results demonstrate the superiority of our approach against state-of-the-art image- and video-level desnowing methods.

Semi-Supervised Video Desnowing Network via Temporal Decoupling Experts and Distribution-Driven Contrastive Regularization

TL;DR

Snow degradation in outdoor videos challenges vision systems due to distribution shift between synthetic and real snow. The authors introduce SemiVDN, a semi-supervised video desnowing network that leverages unlabeled real snow videos via a Mean-Teacher framework and a Distribution-driven Contrastive Regularization to bridge the synthetic-real gap. A Prior-guided Temporal Decoupling Experts module explicitly decomposes snow degradation into snow, transmission, and atmospheric-light components within a physics-informed Transformer, while a GMM-based ultra-positive sampling strategy enhances alignment with real snow distributions. Evaluations on RVSD and RealSnow85 show state-of-the-art performance and improved real-world generalization, with notable gains in PSNR, SSIM, and perceptual metrics, supporting its practical applicability in real-world snowy scenes.

Abstract

Snow degradations present formidable challenges to the advancement of computer vision tasks by the undesirable corruption in outdoor scenarios. While current deep learning-based desnowing approaches achieve success on synthetic benchmark datasets, they struggle to restore out-of-distribution real-world snowy videos due to the deficiency of paired real-world training data. To address this bottleneck, we devise a new paradigm for video desnowing in a semi-supervised spirit to involve unlabeled real data for the generalizable snow removal. Specifically, we construct a real-world dataset with 85 snowy videos, and then present a Semi-supervised Video Desnowing Network (SemiVDN) equipped by a novel Distribution-driven Contrastive Regularization. The elaborated contrastive regularization mitigates the distribution gap between the synthetic and real data, and consequently maintains the desired snow-invariant background details. Furthermore, based on the atmospheric scattering model, we introduce a Prior-guided Temporal Decoupling Experts module to decompose the physical components that make up a snowy video in a frame-correlated manner. We evaluate our SemiVDN on benchmark datasets and the collected real snowy data. The experimental results demonstrate the superiority of our approach against state-of-the-art image- and video-level desnowing methods.

Paper Structure

This paper contains 17 sections, 12 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: The distribution shift of the synthetic snow and real snow.
  • Figure 2: Left sub-figure: The proposed semi-supervised method trained using synthetic and real videos yields favorable results on snowy video samples captured in various scenarios, including forests, country roads and movies. Right sub-figure: Trade-off between PSNR performance v.s Runtime and GFLOPs on RVSD dataset chen2023snow.
  • Figure 3: The schematic illustration of our Semi-Supervised Video Desnowing Network (SemiVDN). SemiVDN is based on the mean teacher scheme with a student model and a teacher model. We first develop a Prior-guided Temporal Decoupling Experts (see Fig. \ref{['fig:method2']}) to decompose the physical components that make up a snow video in a temporal spirit. After that, we compute supervised losses for labeled data and unsupervised losses for unlabeled data. Based on the decomposed component features ($F^{'}_{B}$ and $F^{'}_{S}$ ) in representation space, we develop a Distribution-driven Contrastive Regularization to highlight the snow-invariant information by replacing the snow-specific feature in ultra-positive samples and replacing the background in negative samples.
  • Figure 4: Comparison of the snow layer decomposition results. It indicates our method can decouple more accurate and clean snow layers without background interference.
  • Figure 5: Illustration of the proposed Prior-guided Temporal Decoupling Experts framework. Given an input snowy sequence, Physics Transformer Block (PTB) accepts encoded features as input and employs Temporal Decoupling Experts module to generate physics-specific components (i.e. S, A and T) for recovery. Specifically, we utilize the Temporal Decomposition Router to compute the temporal weights $\mathbf{Q}_{i j}$ from the temporal dimension, which are subsequently employed to compute a linear combination of all input temporal tokens and $\mathbf{Q}_{i j}$. Then each expert (an MLP in this work) processes its temporal adaptive tokens to obtain the corresponding output component tokens $\tilde{\mathbf{E}}_{j}$. Finally, we employ the decomposed weights from Temporal Decomposition Router to convexly combine all the component tokens. The output combined features ${\hat{X}}_{k}$ and physics-specific features ${\hat{P}}^{j}_{k}$ are subsequently input into the Prior-guided Recovery Module and the decoder to generate the ultimate desnowed results.
  • ...and 3 more figures