EvLight++: Low-Light Video Enhancement with an Event Camera: A Large-Scale Real-World Dataset, Novel Method, and More
Kanghao Chen, Guoqiang Liang, Hangyu Li, Yunfan Lu, Lin Wang
TL;DR
This work tackles the lack of large-scale real-world paired event–video data for low-light video enhancement by introducing the SDE dataset, captured with a robotic alignment system that achieves precise spatial ($\sim$0.03 mm) and temporal ($<0.01$ s in 90% of cases) alignment between low-light and normal-light sequences. It then presents EvLight++, an event-guided LLE framework that fuses image and HDR event information through an SNR-guided regional feature selection and a holistic-regional fusion branch, augmented with a convGRU-based temporal module and a temporal loss to enforce illumination consistency. The approach demonstrates superior performance over both frame-based and prior event-guided methods on real SDE data and synthetic SDSD data, with substantial gains in PSNR/PSNR* and SSIM, and shows promising improvements for downstream tasks such as semantic segmentation and monocular depth estimation. The dataset also provides pseudo labels for downstream tasks, enabling practical benchmarking and cross-task evaluations, underscoring the work’s potential impact on real-world low-light vision pipelines and downstream scene understanding in challenging lighting conditions.
Abstract
Event cameras offer significant advantages for low-light video enhancement, primarily due to their high dynamic range. Current research, however, is severely limited by the absence of large-scale, real-world, and spatio-temporally aligned event-video datasets. To address this, we introduce a large-scale dataset with over 30,000 pairs of frames and events captured under varying illumination. This dataset was curated using a robotic arm that traces a consistent non-linear trajectory, achieving spatial alignment precision under 0.03mm and temporal alignment with errors under 0.01s for 90% of the dataset. Based on the dataset, we propose \textbf{EvLight++}, a novel event-guided low-light video enhancement approach designed for robust performance in real-world scenarios. Firstly, we design a multi-scale holistic fusion branch to integrate structural and textural information from both images and events. To counteract variations in regional illumination and noise, we introduce Signal-to-Noise Ratio (SNR)-guided regional feature selection, enhancing features from high SNR regions and augmenting those from low SNR regions by extracting structural information from events. To incorporate temporal information and ensure temporal coherence, we further introduce a recurrent module and temporal loss in the whole pipeline. Extensive experiments on our and the synthetic SDSD dataset demonstrate that EvLight++ significantly outperforms both single image- and video-based methods by 1.37 dB and 3.71 dB, respectively. To further explore its potential in downstream tasks like semantic segmentation and monocular depth estimation, we extend our datasets by adding pseudo segmentation and depth labels via meticulous annotation efforts with foundation models. Experiments under diverse low-light scenes show that the enhanced results achieve a 15.97% improvement in mIoU for semantic segmentation.
