Table of Contents
Fetching ...

LLVD: LSTM-based Explicit Motion Modeling in Latent Space for Blind Video Denoising

Loay Rashid, Siddharth Roheda, Amit Unde

TL;DR

LLVD tackles blind video denoising under noise encountered during capture by embedding a 2-layer convLSTM within the encoded latent space of a U-Net–style architecture. This latent-space temporal modeling enables efficient, on-device denoising for both RAW and RGB data, with two variants (LLVD-S and LLVD-L) that reduce computational load by up to ~75% relative to full architectures. The method achieves state-of-the-art performance on RAW denoising with roughly 0.3 dB improvement and demonstrates strong generalization across real and synthetic noise datasets, while maintaining substantially lower GFLOPs. Ablation studies confirm the critical role of temporal modeling in the latent space and the benefit of a compact encoder–decoder design for lightweight, flicker-free video restoration on resource-constrained devices.

Abstract

Video restoration plays a pivotal role in revitalizing degraded video content by rectifying imperfections caused by various degradations introduced during capturing (sensor noise, motion blur, etc.), saving/sharing (compression, resizing, etc.) and editing. This paper introduces a novel algorithm designed for scenarios where noise is introduced during video capture, aiming to enhance the visual quality of videos by reducing unwanted noise artifacts. We propose the Latent space LSTM Video Denoiser (LLVD), an end-to-end blind denoising model. LLVD uniquely combines spatial and temporal feature extraction, employing Long Short Term Memory (LSTM) within the encoded feature domain. This integration of LSTM layers is crucial for maintaining continuity and minimizing flicker in the restored video. Moreover, processing frames in the encoded feature domain significantly reduces computations, resulting in a very lightweight architecture. LLVD's blind nature makes it versatile for real, in-the-wild denoising scenarios where prior information about noise characteristics is not available. Experiments reveal that LLVD demonstrates excellent performance for both synthetic and captured noise. Specifically, LLVD surpasses the current State-Of-The-Art (SOTA) in RAW denoising by 0.3dB, while also achieving a 59\% reduction in computational complexity.

LLVD: LSTM-based Explicit Motion Modeling in Latent Space for Blind Video Denoising

TL;DR

LLVD tackles blind video denoising under noise encountered during capture by embedding a 2-layer convLSTM within the encoded latent space of a U-Net–style architecture. This latent-space temporal modeling enables efficient, on-device denoising for both RAW and RGB data, with two variants (LLVD-S and LLVD-L) that reduce computational load by up to ~75% relative to full architectures. The method achieves state-of-the-art performance on RAW denoising with roughly 0.3 dB improvement and demonstrates strong generalization across real and synthetic noise datasets, while maintaining substantially lower GFLOPs. Ablation studies confirm the critical role of temporal modeling in the latent space and the benefit of a compact encoder–decoder design for lightweight, flicker-free video restoration on resource-constrained devices.

Abstract

Video restoration plays a pivotal role in revitalizing degraded video content by rectifying imperfections caused by various degradations introduced during capturing (sensor noise, motion blur, etc.), saving/sharing (compression, resizing, etc.) and editing. This paper introduces a novel algorithm designed for scenarios where noise is introduced during video capture, aiming to enhance the visual quality of videos by reducing unwanted noise artifacts. We propose the Latent space LSTM Video Denoiser (LLVD), an end-to-end blind denoising model. LLVD uniquely combines spatial and temporal feature extraction, employing Long Short Term Memory (LSTM) within the encoded feature domain. This integration of LSTM layers is crucial for maintaining continuity and minimizing flicker in the restored video. Moreover, processing frames in the encoded feature domain significantly reduces computations, resulting in a very lightweight architecture. LLVD's blind nature makes it versatile for real, in-the-wild denoising scenarios where prior information about noise characteristics is not available. Experiments reveal that LLVD demonstrates excellent performance for both synthetic and captured noise. Specifically, LLVD surpasses the current State-Of-The-Art (SOTA) in RAW denoising by 0.3dB, while also achieving a 59\% reduction in computational complexity.
Paper Structure (16 sections, 6 equations, 5 figures, 3 tables)

This paper contains 16 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison of PSNR (dB) and computational complexity (GFLOPs) of models on the sRGB CRVD testset. Compared to existing methods, our models (LLVD-S/L) achieve State-Of-The-Art denoising performance with significantly lower complexity.
  • Figure 2: Sequential flow and gating mechanisms in an LSTM layer.
  • Figure 3: An overview of our proposed method.
  • Figure 4: Detailed overview of our model architecture for a single frame. We follow a U-Net style symmetric Encoder-Decoder architecture with skip connections. The encoder and decoder can be broken down into three stages each, with each stage consisting of 5 convolutional layers. The last (first) layer of each stage in the encoder (decoder) performs downsampling (upsampling) using strided convolutions (transposed convolutions). The input video frame and denoised output frame are also connected by a residual connection (not shown).
  • Figure 5: Qualitative comparison on scenes in the CRVD and Set8 testsets. Zoom in for better observation.