Table of Contents
Fetching ...

SAVeD: Learning to Denoise Low-SNR Video for Improved Downstream Performance

Suzanne Stathatos, Michael Hobley, Pietro Perona, Markus Marks

TL;DR

SAVeD tackles denoising in low-SNR video domains (e.g., underwater sonar, ultrasound, microscopy) where clean ground-truth data are unavailable. It introduces a self-supervised framework that amplifies foreground motion through a three-frame reconstruction target and a lightweight encoder–bottleneck–decoder, producing denoised frames without clean targets. A key contribution is the Foreground-to-Background Divergence (FBD) metric, enabling unsupervised evaluation aligned with downstream task performance. The method achieves state-of-the-art gains across classification, detection, tracking, and counting on diverse datasets while requiring fewer training resources than prior denoising approaches, suggesting broad applicability in scientific and medical imaging contexts.

Abstract

Low signal-to-noise ratio videos -- such as those from underwater sonar, ultrasound, and microscopy -- pose significant challenges for computer vision models, particularly when paired clean imagery is unavailable. We present Spatiotemporal Augmentations and denoising in Video for Downstream Tasks (SAVeD), a novel self-supervised method that denoises low-SNR sensor videos using only raw noisy data. By leveraging distinctions between foreground and background motion and exaggerating objects with stronger motion signal, SAVeD enhances foreground object visibility and reduces background and camera noise without requiring clean video. SAVeD has a set of architectural optimizations that lead to faster throughput, training, and inference than existing deep learning methods. We also introduce a new denoising metric, FBD, which indicates foreground-background divergence for detection datasets without requiring clean imagery. Our approach achieves state-of-the-art results for classification, detection, tracking, and counting tasks, and it does so with fewer training resource requirements than existing deep-learning-based denoising methods. Project page: https://suzanne-stathatos.github.io/SAVeD Code page: https://github.com/suzanne-stathatos/SAVeD

SAVeD: Learning to Denoise Low-SNR Video for Improved Downstream Performance

TL;DR

SAVeD tackles denoising in low-SNR video domains (e.g., underwater sonar, ultrasound, microscopy) where clean ground-truth data are unavailable. It introduces a self-supervised framework that amplifies foreground motion through a three-frame reconstruction target and a lightweight encoder–bottleneck–decoder, producing denoised frames without clean targets. A key contribution is the Foreground-to-Background Divergence (FBD) metric, enabling unsupervised evaluation aligned with downstream task performance. The method achieves state-of-the-art gains across classification, detection, tracking, and counting on diverse datasets while requiring fewer training resources than prior denoising approaches, suggesting broad applicability in scientific and medical imaging contexts.

Abstract

Low signal-to-noise ratio videos -- such as those from underwater sonar, ultrasound, and microscopy -- pose significant challenges for computer vision models, particularly when paired clean imagery is unavailable. We present Spatiotemporal Augmentations and denoising in Video for Downstream Tasks (SAVeD), a novel self-supervised method that denoises low-SNR sensor videos using only raw noisy data. By leveraging distinctions between foreground and background motion and exaggerating objects with stronger motion signal, SAVeD enhances foreground object visibility and reduces background and camera noise without requiring clean video. SAVeD has a set of architectural optimizations that lead to faster throughput, training, and inference than existing deep learning methods. We also introduce a new denoising metric, FBD, which indicates foreground-background divergence for detection datasets without requiring clean imagery. Our approach achieves state-of-the-art results for classification, detection, tracking, and counting tasks, and it does so with fewer training resource requirements than existing deep-learning-based denoising methods. Project page: https://suzanne-stathatos.github.io/SAVeD Code page: https://github.com/suzanne-stathatos/SAVeD

Paper Structure

This paper contains 28 sections, 6 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: SpatioTemporal Denoising improves classification, detection, tracking, and counting in video. We denoise sonar and ultrasound videos of fish in a river, lung scans, breast lesion scans, and cell microscopy to improve downstream classification, detection, tracking, and counting tasks. We propose a self-supervised method to enhance the foreground signal of video frames without manual annotations or clean imagery. Our method works on grayscale videos with: non-stationary backgrounds, low signal-to-noise-ratios, and a variable number of objects in a video.
  • Figure 2: SAVeD, our approach for self-supervised denoising using spatiotemporal difference and identity reconstruction. $I_t$, $I_{t-T}$, and $I_{t-2T}$ are video frames at times t (current frame), t-T, and t-2T. These frames are input to an appearance encoder $\Phi$. The resulting feature representations are input to a spatiotemporal bottleneck $\Theta$ that compresses the 3 appearance features into a single spatiotemporal feature representation. Our model then predicts the reconstruction target, defined in \ref{['eqn:reconstruction_target']} in \ref{['sec:reconstruction']}, using the reconstruction decoder $\Psi$. The loss, defined in \ref{['eqn:recon_loss']}, is calculated and backpropagated through all networks. The architecture is discussed in more detail in \ref{['sec:method_arch']}.
  • Figure 3: High-SNR visualization and calculations for $FBD$ and PSNR on Fly-vs-Fly flyvfly with synthetic noise. In order to calculate the PSNR of (B), a 'clean' image (A) is needed. (A) is not needed for $FBD$, though (C) is.
  • Figure 4: Qualitative raw-denoised pairs of SAVeD. Qualitative results for SAVeD trained on POCUS (lung health categorization), BUV (breast lesion detection), CFC22 (fish detection, tracking, and counting), and Fluo (cell denoising).
  • Figure 5: Qualitative denoising performance on CFC22. We can see that the fish is easiest to spot as a bright patch after processing with our denoiser. The green box highlights the fish location. Each denoised image zooms in to that green bounding box. The red arrow in the raw frame points to the fish location. Additional example visualizations are in \ref{['fig:additional_fish_denoised']} in Supp. Mat.
  • ...and 10 more figures