EvLight++: Low-Light Video Enhancement with an Event Camera: A Large-Scale Real-World Dataset, Novel Method, and More

Kanghao Chen; Guoqiang Liang; Hangyu Li; Yunfan Lu; Lin Wang

EvLight++: Low-Light Video Enhancement with an Event Camera: A Large-Scale Real-World Dataset, Novel Method, and More

Kanghao Chen, Guoqiang Liang, Hangyu Li, Yunfan Lu, Lin Wang

TL;DR

This work tackles the lack of large-scale real-world paired event–video data for low-light video enhancement by introducing the SDE dataset, captured with a robotic alignment system that achieves precise spatial ($\sim$0.03 mm) and temporal ($<0.01$ s in 90% of cases) alignment between low-light and normal-light sequences. It then presents EvLight++, an event-guided LLE framework that fuses image and HDR event information through an SNR-guided regional feature selection and a holistic-regional fusion branch, augmented with a convGRU-based temporal module and a temporal loss to enforce illumination consistency. The approach demonstrates superior performance over both frame-based and prior event-guided methods on real SDE data and synthetic SDSD data, with substantial gains in PSNR/PSNR* and SSIM, and shows promising improvements for downstream tasks such as semantic segmentation and monocular depth estimation. The dataset also provides pseudo labels for downstream tasks, enabling practical benchmarking and cross-task evaluations, underscoring the work’s potential impact on real-world low-light vision pipelines and downstream scene understanding in challenging lighting conditions.

Abstract

Event cameras offer significant advantages for low-light video enhancement, primarily due to their high dynamic range. Current research, however, is severely limited by the absence of large-scale, real-world, and spatio-temporally aligned event-video datasets. To address this, we introduce a large-scale dataset with over 30,000 pairs of frames and events captured under varying illumination. This dataset was curated using a robotic arm that traces a consistent non-linear trajectory, achieving spatial alignment precision under 0.03mm and temporal alignment with errors under 0.01s for 90% of the dataset. Based on the dataset, we propose \textbf{EvLight++}, a novel event-guided low-light video enhancement approach designed for robust performance in real-world scenarios. Firstly, we design a multi-scale holistic fusion branch to integrate structural and textural information from both images and events. To counteract variations in regional illumination and noise, we introduce Signal-to-Noise Ratio (SNR)-guided regional feature selection, enhancing features from high SNR regions and augmenting those from low SNR regions by extracting structural information from events. To incorporate temporal information and ensure temporal coherence, we further introduce a recurrent module and temporal loss in the whole pipeline. Extensive experiments on our and the synthetic SDSD dataset demonstrate that EvLight++ significantly outperforms both single image- and video-based methods by 1.37 dB and 3.71 dB, respectively. To further explore its potential in downstream tasks like semantic segmentation and monocular depth estimation, we extend our datasets by adding pseudo segmentation and depth labels via meticulous annotation efforts with foundation models. Experiments under diverse low-light scenes show that the enhanced results achieve a 15.97% improvement in mIoU for semantic segmentation.

EvLight++: Low-Light Video Enhancement with an Event Camera: A Large-Scale Real-World Dataset, Novel Method, and More

TL;DR

0.03 mm) and temporal (

s in 90% of cases) alignment between low-light and normal-light sequences. It then presents EvLight++, an event-guided LLE framework that fuses image and HDR event information through an SNR-guided regional feature selection and a holistic-regional fusion branch, augmented with a convGRU-based temporal module and a temporal loss to enforce illumination consistency. The approach demonstrates superior performance over both frame-based and prior event-guided methods on real SDE data and synthetic SDSD data, with substantial gains in PSNR/PSNR* and SSIM, and shows promising improvements for downstream tasks such as semantic segmentation and monocular depth estimation. The dataset also provides pseudo labels for downstream tasks, enabling practical benchmarking and cross-task evaluations, underscoring the work’s potential impact on real-world low-light vision pipelines and downstream scene understanding in challenging lighting conditions.

Abstract

Paper Structure (34 sections, 10 equations, 18 figures, 10 tables)

This paper contains 34 sections, 10 equations, 18 figures, 10 tables.

Introduction
Related Work
Mechanism System and Our SDE Dataset
Overview of Robotic System
Spatial and Temporal Alignments Pipeline
Data Capture with Spatial Alignment
Temporal Alignment of Low-light/Normal-light sequences
The proposed SDE dataset
Paired Low-light/Normal-light Dataset
Labeling for Downstream Tasks
The Proposed EvLight++ Framework
Preprocessing
SNR-guided Regional Feature Selection
Holistic-Regional Fusion Branch
Optimization
...and 19 more sections

Figures (18)

Figure 1: (a) A challenging example of our dataset containing an extremely low-light image and events. Compared with the result from a SOTA frame-based method Retinexformer cai2023retinexformer, our EvLight++ not only recovers the structure details (e.g., the pipe on the ceiling) but also avoids over-enhancement and saturation in the bright regions (e.g., the lights). (b&c) Based on the enhanced outputs, semantic segmentation, and depth estimation are conducted with the off-the-shelf models.
Figure 2: An illustration of our mechanism system to collect spatially aligned image-event dataset by mounting a DAVIS 346 event camera on the UR5e robotic arm and recording the sequences with the pre-defined trajectory. The corresponding low-light sequence is captured by applying an ND8 filter to the camera lens.
Figure 3: (a) An illustration of the variable time interval between the start timestamp of the trajectory and the first frame timestamp in each sequence. (b) An example of the matching alignment strategy.
Figure 4: (a) Distribution of temporal alignment error (measured in seconds) of our dataset. (b) Distribution of video length of our dataset. (c) Illumination distribution in the filming environment.
Figure 5: (a) An example of our dataset with images and paired events captured in low-light (with an ND8 filter) and normal-light conditions. (b) Examples of the downstream annotations of semantic segmentation and depth estimation.
...and 13 more figures

EvLight++: Low-Light Video Enhancement with an Event Camera: A Large-Scale Real-World Dataset, Novel Method, and More

TL;DR

Abstract

EvLight++: Low-Light Video Enhancement with an Event Camera: A Large-Scale Real-World Dataset, Novel Method, and More

Authors

TL;DR

Abstract

Table of Contents

Figures (18)