Table of Contents
Fetching ...

Low-Light Video Enhancement with An Effective Spatial-Temporal Decomposition Paradigm

Xiaogang Xu, Kun Zhou, Tao Hu, Jiafei Wu, Ruixing Wang, Hao Peng, Bei Yu

TL;DR

This work tackles the challenge of preserving temporal and spatial consistency in low-light video enhancement by introducing a view-aware decomposition that separates frame content into a view-dependent term and a view-independent term. The core method, VLLVE, leverages cross-frame correspondences and a Cross-Frame Interaction Module to enforce consistent decomposition with minimal parameter overhead, and it introduces a dual-training regime for joint enhancement and correspondence supervision. Building on this, VLLVE++ adds a residual component to model scene-adaptive degradations and a correspondence refinement network that enables bidirectional learning between enhancement and correspondence quality, yielding stronger frame-level and video-level results. The approach achieves state-of-the-art performance across multiple LLVE benchmarks and real-world datasets, while remaining computationally efficient and adaptable to various backbone architectures, with demonstrated benefits for tasks like NeRF that rely on high-quality, consistent imagery.

Abstract

Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise. In this paper, we present an innovative video decomposition strategy that incorporates view-independent and view-dependent components to enhance the performance of LLVE. The framework is called View-aware Low-light Video Enhancement (VLLVE). We leverage dynamic cross-frame correspondences for the view-independent term (which primarily captures intrinsic appearance) and impose a scene-level continuity constraint on the view-dependent term (which mainly describes the shading condition) to achieve consistent and satisfactory decomposition results. To further ensure consistent decomposition, we introduce a dual-structure enhancement network featuring a cross-frame interaction mechanism. By supervising different frames simultaneously, this network encourages them to exhibit matching decomposition features. This mechanism can seamlessly integrate with encoder-decoder single-frame networks, incurring minimal additional parameter costs. Building upon VLLVE, we propose a more comprehensive decomposition strategy by introducing an additive residual term, resulting in VLLVE++. This residual term can simulate scene-adaptive degradations, which are difficult to model using a decomposition formulation for common scenes, thereby further enhancing the ability to capture the overall content of videos. In addition, VLLVE++ enables bidirectional learning for both enhancement and degradation-aware correspondence refinement (end-to-end manner), effectively increasing reliable correspondences while filtering out incorrect ones. Notably, VLLVE++ demonstrates strong capability in handling challenging cases, such as real-world scenes and videos with high dynamics. Extensive experiments are conducted on widely recognized LLVE benchmarks.

Low-Light Video Enhancement with An Effective Spatial-Temporal Decomposition Paradigm

TL;DR

This work tackles the challenge of preserving temporal and spatial consistency in low-light video enhancement by introducing a view-aware decomposition that separates frame content into a view-dependent term and a view-independent term. The core method, VLLVE, leverages cross-frame correspondences and a Cross-Frame Interaction Module to enforce consistent decomposition with minimal parameter overhead, and it introduces a dual-training regime for joint enhancement and correspondence supervision. Building on this, VLLVE++ adds a residual component to model scene-adaptive degradations and a correspondence refinement network that enables bidirectional learning between enhancement and correspondence quality, yielding stronger frame-level and video-level results. The approach achieves state-of-the-art performance across multiple LLVE benchmarks and real-world datasets, while remaining computationally efficient and adaptable to various backbone architectures, with demonstrated benefits for tasks like NeRF that rely on high-quality, consistent imagery.

Abstract

Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise. In this paper, we present an innovative video decomposition strategy that incorporates view-independent and view-dependent components to enhance the performance of LLVE. The framework is called View-aware Low-light Video Enhancement (VLLVE). We leverage dynamic cross-frame correspondences for the view-independent term (which primarily captures intrinsic appearance) and impose a scene-level continuity constraint on the view-dependent term (which mainly describes the shading condition) to achieve consistent and satisfactory decomposition results. To further ensure consistent decomposition, we introduce a dual-structure enhancement network featuring a cross-frame interaction mechanism. By supervising different frames simultaneously, this network encourages them to exhibit matching decomposition features. This mechanism can seamlessly integrate with encoder-decoder single-frame networks, incurring minimal additional parameter costs. Building upon VLLVE, we propose a more comprehensive decomposition strategy by introducing an additive residual term, resulting in VLLVE++. This residual term can simulate scene-adaptive degradations, which are difficult to model using a decomposition formulation for common scenes, thereby further enhancing the ability to capture the overall content of videos. In addition, VLLVE++ enables bidirectional learning for both enhancement and degradation-aware correspondence refinement (end-to-end manner), effectively increasing reliable correspondences while filtering out incorrect ones. Notably, VLLVE++ demonstrates strong capability in handling challenging cases, such as real-world scenes and videos with high dynamics. Extensive experiments are conducted on widely recognized LLVE benchmarks.
Paper Structure (32 sections, 19 equations, 10 figures, 17 tables, 2 algorithms)

This paper contains 32 sections, 19 equations, 10 figures, 17 tables, 2 algorithms.

Figures (10)

  • Figure 1: Our proposed LLVE method consistently achieves SOTA performance on different LLVE datasets involving various scenes with the same network architecture.
  • Figure 2: Our framework offers a comprehensive solution that explicitly and consistently models the view-independent and view-dependent decomposition of enhanced normal-light outputs across different frames. To achieve this, we enforce consistent features in the view-independent terms across different frames by leveraging computed correspondences in the temporal dimension of videos ($\mathcal{O}(\boldsymbol{R}_t)$). Simultaneously, we ensure that the view-dependent terms exhibit a spatially continuous distribution ($\mathcal{O}(\boldsymbol{L}_t)$), aligning with real-world scenarios. Furthermore, our network incorporates cross-frame interaction and simultaneous supervision of different frames within a video, encouraging consistent features for these frames derived from one video. For a more detailed visual representation, please refer to \ref{['fig:framework2']}.
  • Figure 3: The lightweight Cross-Frame Interaction Mechanism (CFIM) propagates different frames' features, along with cross-frame attention and spatial-channel fusion. Cross-frame interaction can be employed in the deep feature space of arbitrary single-image encoder-decoder frameworks, and we choose U-Net here.
  • Figure 4: The overview of VLLVE++ with new decomposition strategy, including the network structure and the training loss terms.
  • Figure 5: Visualization of incorrect correspondences predicted by the pretrained estimator DKM (image from the LLVE test set). Although pretrained models have been optimized for generalization, they inevitably generate some erroneous correspondences because they are not fine-tuned on LLVE. These mismatched points often exhibit unduly distinct values. To this, we refine them by aligning matched points to a view-independent intrinsic value. While RGB values can also be used for alignment, their effect is weaker (see \ref{['comparison-vertify']}). As for implementation, we introduce a correspondence refinement network trained jointly with the LLVE model. For illustration, we visualize three representative points from the predicted correspondences here.
  • ...and 5 more figures