Dynamic Weight-based Temporal Aggregation for Low-light Video Enhancement
Ruirui Lin, Guoxi Huang, Nantheera Anantrasirichai
TL;DR
This work tackles low-light video enhancement by jointly leveraging short-term frame alignment and long-term temporal information through DWTA-Net. It introduces a two-stage framework with Stage I using Visual State-Space blocks for multi-frame enhancement and Stage II employing a motion-guided dynamic recurrent refinement, governed by a residual-based weight map $\\omega$ and optical-flow warps. The texture-adaptive loss integrates $2$D-DWT high-frequency texture cues to balance detail preservation with smoothing via a weighted combination of pixel, perceptual, and smoothing losses, controlled by a texture map $M_T$. Across the DID dataset and challenging in-the-wild sequences, DWTA-Net achieves state-of-the-art PSNR and perceptual quality while maintaining temporal consistency, with ablations confirming the necessity of both stages and the texture-aware loss for peak performance.
Abstract
Low-light video enhancement (LLVE) is challenging due to noise, low contrast, and color degradations. Learning-based approaches offer fast inference but still struggle with heavy noise in real low-light scenes, primarily due to limitations in effectively leveraging temporal information. In this paper, we address this issue with DWTA-Net, a novel two-stage framework that jointly exploits short- and long-term temporal cues. Stage I employs Visual State-Space blocks for multi-frame alignment, recovering brightness, color, and structure with local consistency. Stage II introduces a recurrent refinement module with dynamic weight-based temporal aggregation guided by optical flow, adaptively balancing static and dynamic regions. A texture-adaptive loss further preserves fine details while promoting smoothness in flat areas. Experiments on real-world low-light videos show that DWTA-Net effectively suppresses noise and artifacts, delivering superior visual quality compared with state-of-the-art methods.
