Low-Light Video Enhancement with An Effective Spatial-Temporal Decomposition Paradigm
Xiaogang Xu, Kun Zhou, Tao Hu, Jiafei Wu, Ruixing Wang, Hao Peng, Bei Yu
TL;DR
This work tackles the challenge of preserving temporal and spatial consistency in low-light video enhancement by introducing a view-aware decomposition that separates frame content into a view-dependent term and a view-independent term. The core method, VLLVE, leverages cross-frame correspondences and a Cross-Frame Interaction Module to enforce consistent decomposition with minimal parameter overhead, and it introduces a dual-training regime for joint enhancement and correspondence supervision. Building on this, VLLVE++ adds a residual component to model scene-adaptive degradations and a correspondence refinement network that enables bidirectional learning between enhancement and correspondence quality, yielding stronger frame-level and video-level results. The approach achieves state-of-the-art performance across multiple LLVE benchmarks and real-world datasets, while remaining computationally efficient and adaptable to various backbone architectures, with demonstrated benefits for tasks like NeRF that rely on high-quality, consistent imagery.
Abstract
Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise. In this paper, we present an innovative video decomposition strategy that incorporates view-independent and view-dependent components to enhance the performance of LLVE. The framework is called View-aware Low-light Video Enhancement (VLLVE). We leverage dynamic cross-frame correspondences for the view-independent term (which primarily captures intrinsic appearance) and impose a scene-level continuity constraint on the view-dependent term (which mainly describes the shading condition) to achieve consistent and satisfactory decomposition results. To further ensure consistent decomposition, we introduce a dual-structure enhancement network featuring a cross-frame interaction mechanism. By supervising different frames simultaneously, this network encourages them to exhibit matching decomposition features. This mechanism can seamlessly integrate with encoder-decoder single-frame networks, incurring minimal additional parameter costs. Building upon VLLVE, we propose a more comprehensive decomposition strategy by introducing an additive residual term, resulting in VLLVE++. This residual term can simulate scene-adaptive degradations, which are difficult to model using a decomposition formulation for common scenes, thereby further enhancing the ability to capture the overall content of videos. In addition, VLLVE++ enables bidirectional learning for both enhancement and degradation-aware correspondence refinement (end-to-end manner), effectively increasing reliable correspondences while filtering out incorrect ones. Notably, VLLVE++ demonstrates strong capability in handling challenging cases, such as real-world scenes and videos with high dynamics. Extensive experiments are conducted on widely recognized LLVE benchmarks.
