Table of Contents
Fetching ...

Spatiotemporal Blind-Spot Network with Calibrated Flow Alignment for Self-Supervised Video Denoising

Zikang Chen, Tao Jiang, Xiaowan Hu, Wang Zhang, Huaqiu Li, Haoqian Wang

TL;DR

This work tackles self-supervised video denoising by removing the reliance on ground-truth clean videos. It introduces the Spatiotemporal Blind-Spot Network (STBN), which fuses bidirectional blind-spot temporal propagation (via the Blind-Spot Alignment block) with an expanded spatial receptive field (SRFE) to capture long-range spatiotemporal context. A key contribution is the calibrated frame-alignment strategy and an unsupervised optical-flow refinement through knowledge distillation, which stabilizes inter-frame interactions under noise. The method achieves state-of-the-art or competitive results on both synthetic and real-world noisy video datasets, demonstrating strong practical utility without labeled data, and the code is publicly available.

Abstract

Self-supervised video denoising aims to remove noise from videos without relying on ground truth data, leveraging the video itself to recover clean frames. Existing methods often rely on simplistic feature stacking or apply optical flow without thorough analysis. This results in suboptimal utilization of both inter-frame and intra-frame information, and it also neglects the potential of optical flow alignment under self-supervised conditions, leading to biased and insufficient denoising outcomes. To this end, we first explore the practicality of optical flow in the self-supervised setting and introduce a SpatioTemporal Blind-spot Network (STBN) for global frame feature utilization. In the temporal domain, we utilize bidirectional blind-spot feature propagation through the proposed blind-spot alignment block to ensure accurate temporal alignment and effectively capture long-range dependencies. In the spatial domain, we introduce the spatial receptive field expansion module, which enhances the receptive field and improves global perception capabilities. Additionally, to reduce the sensitivity of optical flow estimation to noise, we propose an unsupervised optical flow distillation mechanism that refines fine-grained inter-frame interactions during optical flow alignment. Our method demonstrates superior performance across both synthetic and real-world video denoising datasets. The source code is publicly available at https://github.com/ZKCCZ/STBN.

Spatiotemporal Blind-Spot Network with Calibrated Flow Alignment for Self-Supervised Video Denoising

TL;DR

This work tackles self-supervised video denoising by removing the reliance on ground-truth clean videos. It introduces the Spatiotemporal Blind-Spot Network (STBN), which fuses bidirectional blind-spot temporal propagation (via the Blind-Spot Alignment block) with an expanded spatial receptive field (SRFE) to capture long-range spatiotemporal context. A key contribution is the calibrated frame-alignment strategy and an unsupervised optical-flow refinement through knowledge distillation, which stabilizes inter-frame interactions under noise. The method achieves state-of-the-art or competitive results on both synthetic and real-world noisy video datasets, demonstrating strong practical utility without labeled data, and the code is publicly available.

Abstract

Self-supervised video denoising aims to remove noise from videos without relying on ground truth data, leveraging the video itself to recover clean frames. Existing methods often rely on simplistic feature stacking or apply optical flow without thorough analysis. This results in suboptimal utilization of both inter-frame and intra-frame information, and it also neglects the potential of optical flow alignment under self-supervised conditions, leading to biased and insufficient denoising outcomes. To this end, we first explore the practicality of optical flow in the self-supervised setting and introduce a SpatioTemporal Blind-spot Network (STBN) for global frame feature utilization. In the temporal domain, we utilize bidirectional blind-spot feature propagation through the proposed blind-spot alignment block to ensure accurate temporal alignment and effectively capture long-range dependencies. In the spatial domain, we introduce the spatial receptive field expansion module, which enhances the receptive field and improves global perception capabilities. Additionally, to reduce the sensitivity of optical flow estimation to noise, we propose an unsupervised optical flow distillation mechanism that refines fine-grained inter-frame interactions during optical flow alignment. Our method demonstrates superior performance across both synthetic and real-world video denoising datasets. The source code is publicly available at https://github.com/ZKCCZ/STBN.

Paper Structure

This paper contains 33 sections, 17 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Illustrative comparison of frame sequence utilization strategies in self-supervised video denoising methods.
  • Figure 2: Illustration of the proposed method: (a) Overall architecture of STBN, including spatiotemporal feature aggregation and optical flow refinement. (b) The Bidirectional Blind-Spot Propagation utilizes the BSA block for global temporal awareness in both forward and backward propagation. (c) Detailed process of the Spatial Receptive Field Expansion module, which sequentially incorporates patch-shuffle, residual blocks, and patch-unshuffle to effectively enhance the spatial receptive field.
  • Figure 3: Visualization of (a) BSA block for temporal processing and (b) SRFE for spatial receptive field expansion.
  • Figure 4: Visualization of noise distribution and correlation for two interpolation methods. Bilinear interpolation introduces spatial correlation and distorts the noise distribution, while nearest-neighbor interpolation preserves it.
  • Figure 5: Visual comparisons of different methods on synthetic noise data.
  • ...and 6 more figures