Table of Contents
Fetching ...

Frames2Residual: Spatiotemporal Decoupling for Self-Supervised Video Denoising

Mingjie Ji, Zhan Shi, Kailai Zhou, Zixuan Fu, Xun Cao

TL;DR

F2R is a spatiotemporal decoupling framework that explicitly divides self-supervised training into two distinct stages: blind temporal consistency modeling and non-blind spatial texture recovery, and allows F2R to outperform existing self-supervised methods on both sRGB and raw video benchmarks.

Abstract

Self-supervised video denoising methods typically extend image-based frameworks into the temporal dimension, yet they often struggle to integrate inter-frame temporal consistency with intra-frame spatial specificity. Existing Video Blind-Spot Networks (BSNs) require noise independence by masking the center pixel, this constraint prevents the use of spatial evidence for texture recovery, thereby severing spatiotemporal correlations and causing texture loss. To address this, we propose Frames2Residual (F2R), a spatiotemporal decoupling framework that explicitly divides self-supervised training into two distinct stages: blind temporal consistency modeling and non-blind spatial texture recovery. In Stage 1, a blind temporal estimator learns inter-frame consistency using a frame-wise blind strategy, producing a temporally consistent anchor. In Stage 2, a non-blind spatial refiner leverages this anchor to safely reintroduce the center frame and recover intra-frame high-frequency spatial residuals while preserving temporal stability. Extensive experiments demonstrate that our decoupling strategy allows F2R to outperform existing self-supervised methods on both sRGB and raw video benchmarks.

Frames2Residual: Spatiotemporal Decoupling for Self-Supervised Video Denoising

TL;DR

F2R is a spatiotemporal decoupling framework that explicitly divides self-supervised training into two distinct stages: blind temporal consistency modeling and non-blind spatial texture recovery, and allows F2R to outperform existing self-supervised methods on both sRGB and raw video benchmarks.

Abstract

Self-supervised video denoising methods typically extend image-based frameworks into the temporal dimension, yet they often struggle to integrate inter-frame temporal consistency with intra-frame spatial specificity. Existing Video Blind-Spot Networks (BSNs) require noise independence by masking the center pixel, this constraint prevents the use of spatial evidence for texture recovery, thereby severing spatiotemporal correlations and causing texture loss. To address this, we propose Frames2Residual (F2R), a spatiotemporal decoupling framework that explicitly divides self-supervised training into two distinct stages: blind temporal consistency modeling and non-blind spatial texture recovery. In Stage 1, a blind temporal estimator learns inter-frame consistency using a frame-wise blind strategy, producing a temporally consistent anchor. In Stage 2, a non-blind spatial refiner leverages this anchor to safely reintroduce the center frame and recover intra-frame high-frequency spatial residuals while preserving temporal stability. Extensive experiments demonstrate that our decoupling strategy allows F2R to outperform existing self-supervised methods on both sRGB and raw video benchmarks.
Paper Structure (12 sections, 4 equations, 6 figures, 7 tables)

This paper contains 12 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Comparison of self-supervised video denoising paradigms during the inference phase. (a) Video Noise2Noise. Warping-based supervision violates noise independence and causes warping artifacts, leading to blurred details. (b) Video Blind-Spot Network. The inherent imposition of pixel discontinuities severs vital spatiotemporal correlations, resulting the loss in texture. (c) The proposed F2R framework. F2R employs joint spatial inputs $\{z_i\}$, where $z_i=\mathrm{cat}(\hat{x}_i,r_i)$, $r_i=y_i-\hat{x}_i$, and $\hat{x}_i$ is derived from a pre-trained image denoiser $\mathcal{D}$. (d) Visualization of F2R framework. The gray dashed line illustrates the temporal consistency established by Stage 1. The final output acquires spatial specificity while maintaining temporal consistency, thereby effectively restoring spatiotemporal correlations.
  • Figure 2: Frame-wise Blind Strategy. (a) Preprocessing for input construction. (b) Blind estimator training. The flow pyramid $\mathcal{V}$ is constructed by downsampling the base flow $\{\mathcal{V}_{i \to t}\}_{i \neq t}$ and halving its magnitude at each level.
  • Figure 3: (a) Training and (b) Inference phases of the Spatial Refiner. Notably, the preprocessing operations are strictly identical across both phases. The recorruption noise $n'$ is sampled from the known noise model.
  • Figure 4: Visual comparison on DAVIS (top) and Set8 (bottom) datasets under noise level $\sigma = 30$. We compare our F2R with supervised (FastDVDNet, FloRNN, NAFNet) and unsupervised (UDVD, TAP) methods, with PSNR/SSIM metrics shown below each patch. Yellow arrows indicate regions for detailed texture comparison. $\dag$ indicates the supervised method.
  • Figure 5: Visual comparison on CRVD indoor dataset. The results have been converted to the sRGB domain with the pretrained ISP provided in yue2020supervised for visualization. Yellow arrows indicate regions for detailed texture comparison. $\dag$ indicates the supervised method.
  • ...and 1 more figures