Table of Contents
Fetching ...

Dynamic Weight-based Temporal Aggregation for Low-light Video Enhancement

Ruirui Lin, Guoxi Huang, Nantheera Anantrasirichai

TL;DR

This work tackles low-light video enhancement by jointly leveraging short-term frame alignment and long-term temporal information through DWTA-Net. It introduces a two-stage framework with Stage I using Visual State-Space blocks for multi-frame enhancement and Stage II employing a motion-guided dynamic recurrent refinement, governed by a residual-based weight map $\\omega$ and optical-flow warps. The texture-adaptive loss integrates $2$D-DWT high-frequency texture cues to balance detail preservation with smoothing via a weighted combination of pixel, perceptual, and smoothing losses, controlled by a texture map $M_T$. Across the DID dataset and challenging in-the-wild sequences, DWTA-Net achieves state-of-the-art PSNR and perceptual quality while maintaining temporal consistency, with ablations confirming the necessity of both stages and the texture-aware loss for peak performance.

Abstract

Low-light video enhancement (LLVE) is challenging due to noise, low contrast, and color degradations. Learning-based approaches offer fast inference but still struggle with heavy noise in real low-light scenes, primarily due to limitations in effectively leveraging temporal information. In this paper, we address this issue with DWTA-Net, a novel two-stage framework that jointly exploits short- and long-term temporal cues. Stage I employs Visual State-Space blocks for multi-frame alignment, recovering brightness, color, and structure with local consistency. Stage II introduces a recurrent refinement module with dynamic weight-based temporal aggregation guided by optical flow, adaptively balancing static and dynamic regions. A texture-adaptive loss further preserves fine details while promoting smoothness in flat areas. Experiments on real-world low-light videos show that DWTA-Net effectively suppresses noise and artifacts, delivering superior visual quality compared with state-of-the-art methods.

Dynamic Weight-based Temporal Aggregation for Low-light Video Enhancement

TL;DR

This work tackles low-light video enhancement by jointly leveraging short-term frame alignment and long-term temporal information through DWTA-Net. It introduces a two-stage framework with Stage I using Visual State-Space blocks for multi-frame enhancement and Stage II employing a motion-guided dynamic recurrent refinement, governed by a residual-based weight map and optical-flow warps. The texture-adaptive loss integrates D-DWT high-frequency texture cues to balance detail preservation with smoothing via a weighted combination of pixel, perceptual, and smoothing losses, controlled by a texture map . Across the DID dataset and challenging in-the-wild sequences, DWTA-Net achieves state-of-the-art PSNR and perceptual quality while maintaining temporal consistency, with ablations confirming the necessity of both stages and the texture-aware loss for peak performance.

Abstract

Low-light video enhancement (LLVE) is challenging due to noise, low contrast, and color degradations. Learning-based approaches offer fast inference but still struggle with heavy noise in real low-light scenes, primarily due to limitations in effectively leveraging temporal information. In this paper, we address this issue with DWTA-Net, a novel two-stage framework that jointly exploits short- and long-term temporal cues. Stage I employs Visual State-Space blocks for multi-frame alignment, recovering brightness, color, and structure with local consistency. Stage II introduces a recurrent refinement module with dynamic weight-based temporal aggregation guided by optical flow, adaptively balancing static and dynamic regions. A texture-adaptive loss further preserves fine details while promoting smoothness in flat areas. Experiments on real-world low-light videos show that DWTA-Net effectively suppresses noise and artifacts, delivering superior visual quality compared with state-of-the-art methods.

Paper Structure

This paper contains 10 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Noise suppression comparison: recurrent aggregation vs. fixed 5-frame averaging. Recurrence leverages long-term information for stronger suppression.
  • Figure 2: Overview of the proposed DWTA-Net. (a) Stage I: multi-frame enhancement for brightness and structure restoration. (b) Stage II: recurrent refinement with dynamic temporal aggregation for long-term consistency.
  • Figure 3: low-light enhancement comparison using histogram stretching, SDSD-net, Starlight, WaveMamba and our method.
  • Figure 4: Qualitative comparison of denoising performance (and brightness enhancement) on the static sky.