Table of Contents
Fetching ...

Zero-Shot Video Deraining with Video Diffusion Models

Tuomas Varanka, Juan Luis Gonzalez, Hyeongwoo Kim, Pablo Garrido, Xu Yao

TL;DR

The paper tackles zero-shot video deraining by leveraging a pretrained video diffusion model to remove rain without synthetic data or fine-tuning. It inverts the input video into the diffusion latent space and edits the reconstruction via negative prompting, augmented by an attention-switching mechanism in the diffusion transformer to preserve motion and structure. Key innovations include a diffusion-prior editing objective with a skip timestep, cross-condition hidden features for stable KV swapping, and selective attention switching over high-frequency blocks. Empirical results on real-world rain datasets show superior rain removal and temporal consistency compared with state-of-the-art baselines, along with ablations that validate the design choices. The approach offers a practical, training-free pathway for video restoration and generalizes to other degradation-removal tasks as diffusion models continue to mature.

Abstract

Existing video deraining methods are often trained on paired datasets, either synthetic, which limits their ability to generalize to real-world rain, or captured by static cameras, which restricts their effectiveness in dynamic scenes with background and camera motion. Furthermore, recent works in fine-tuning diffusion models have shown promising results, but the fine-tuning tends to weaken the generative prior, limiting generalization to unseen cases. In this paper, we introduce the first zero-shot video deraining method for complex dynamic scenes that does not require synthetic data nor model fine-tuning, by leveraging a pretrained text-to-video diffusion model that demonstrates strong generalization capabilities. By inverting an input video into the latent space of diffusion models, its reconstruction process can be intervened and pushed away from the model's concept of rain using negative prompting. At the core of our approach is an attention switching mechanism that we found is crucial for maintaining dynamic backgrounds as well as structural consistency between the input and the derained video, mitigating artifacts introduced by naive negative prompting. Our approach is validated through extensive experiments on real-world rain datasets, demonstrating substantial improvements over prior methods and showcasing robust generalization without the need for supervised training.

Zero-Shot Video Deraining with Video Diffusion Models

TL;DR

The paper tackles zero-shot video deraining by leveraging a pretrained video diffusion model to remove rain without synthetic data or fine-tuning. It inverts the input video into the diffusion latent space and edits the reconstruction via negative prompting, augmented by an attention-switching mechanism in the diffusion transformer to preserve motion and structure. Key innovations include a diffusion-prior editing objective with a skip timestep, cross-condition hidden features for stable KV swapping, and selective attention switching over high-frequency blocks. Empirical results on real-world rain datasets show superior rain removal and temporal consistency compared with state-of-the-art baselines, along with ablations that validate the design choices. The approach offers a practical, training-free pathway for video restoration and generalizes to other degradation-removal tasks as diffusion models continue to mature.

Abstract

Existing video deraining methods are often trained on paired datasets, either synthetic, which limits their ability to generalize to real-world rain, or captured by static cameras, which restricts their effectiveness in dynamic scenes with background and camera motion. Furthermore, recent works in fine-tuning diffusion models have shown promising results, but the fine-tuning tends to weaken the generative prior, limiting generalization to unseen cases. In this paper, we introduce the first zero-shot video deraining method for complex dynamic scenes that does not require synthetic data nor model fine-tuning, by leveraging a pretrained text-to-video diffusion model that demonstrates strong generalization capabilities. By inverting an input video into the latent space of diffusion models, its reconstruction process can be intervened and pushed away from the model's concept of rain using negative prompting. At the core of our approach is an attention switching mechanism that we found is crucial for maintaining dynamic backgrounds as well as structural consistency between the input and the derained video, mitigating artifacts introduced by naive negative prompting. Our approach is validated through extensive experiments on real-world rain datasets, demonstrating substantial improvements over prior methods and showcasing robust generalization without the need for supervised training.

Paper Structure

This paper contains 21 sections, 7 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Visual results on video deraining on real rain video. Compared with state-of-the-art methods such as RainMamba wu2024-rainmamba and TURTLE ghasemabadi2024-turtle, our zero-shot solution is more effective in removing rain streaks and generates temporally consistent results. Please refer to the supplementary for the corresponding video.
  • Figure 2: Example cases of rain. Notably, synthetic rain often exhibits unrealistic repetitive patterns not present in real-world scenes while lacking interactions with the scene objects and depth. Rainy scenes generated by large diffusion models show more realistic effects. Refer to the supplementary for the respective videos.
  • Figure 3: Architecture of the proposed zero-shot video deraining approach. First, video inversion is performed to extract the noise latent $z_{img}^T$. Next, starting from timestep $t_s$ the model performs a reconstruction step with the null prompt and a rain condition step with the negative prompt. The two paths are then combined following \ref{['eq:editing']}. Attention switching is applied for blocks $\mathcal{B}$, where the $K^{\text{\o}}_{text}$ and $V^{\text{\o}}_{text}$ are extracted from the null condition and are used to replace their conditional equivalents $K^{c}_{text}$ and $V^{c}_{text}$.
  • Figure 4: Comparison between DDIM song2022-ddim and DDPM huberman2024-ddpm_inversion inversion on video data. Video DDIM inversion struggles with fully reconstructing the video and misses not only high-frequency details but also larger objects. The PSNR drop in video DDPM inversion is mostly caused by the VideoVAE (PSNR = 31.80) and numerical precision.
  • Figure 5: Impact of skipping an individual block on the PSNR. Losing high-frequency information, which is captured by blocks 0-5 and blocks 15-30, has a lower impact on PSNR.
  • ...and 11 more figures