CoSEC: A Coaxial Stereo Event Camera Dataset for Autonomous Driving
Shihan Peng, Hanyu Zhou, Hao Dong, Zhiwei Shi, Haoyue Liu, Yuxing Duan, Yi Chang, Luxin Yan
TL;DR
CoSEC tackles the challenge of cross-modal alignment in autonomous driving by introducing coaxial beam-splitter based event-frame devices that minimize the baseline between event and frame cameras, enabling pixel-level spatiotemporal fusion. The authors build a coaxial stereo event-camera dataset with LiDAR, IMU, and GNSS, and establish time synchronization and calibration pipelines, generating high-quality ground-truth depth and optical flow through LiDAR-SLAM fusion and event-frame enhancement for nighttime conditions. They demonstrate that coaxial alignment improves multimodal fusion performance and generalization, particularly in low-light/nighttime scenes, outperforming parallel-layout datasets. The dataset covers all-day sequences across diverse environments and provides train/test splits to support robust development of cross-modal fusion methods for 3D dynamic scene perception in autonomous driving.
Abstract
Conventional frame camera is the mainstream sensor of the autonomous driving scene perception, while it is limited in adverse conditions, such as low light. Event camera with high dynamic range has been applied in assisting frame camera for the multimodal fusion, which relies heavily on the pixel-level spatial alignment between various modalities. Typically, existing multimodal datasets mainly place event and frame cameras in parallel and directly align them spatially via warping operation. However, this parallel strategy is less effective for multimodal fusion, since the large disparity exacerbates spatial misalignment due to the large event-frame baseline. We argue that baseline minimization can reduce alignment error between event and frame cameras. In this work, we introduce hybrid coaxial event-frame devices to build the multimodal system, and propose a coaxial stereo event camera (CoSEC) dataset for autonomous driving. As for the multimodal system, we first utilize the microcontroller to achieve time synchronization, and then spatially calibrate different sensors, where we perform intra- and inter-calibration of stereo coaxial devices. As for the multimodal dataset, we filter LiDAR point clouds to generate depth and optical flow labels using reference depth, which is further improved by fusing aligned event and frame data in nighttime conditions. With the help of the coaxial device, the proposed dataset can promote the all-day pixel-level multimodal fusion. Moreover, we also conduct experiments to demonstrate that the proposed dataset can improve the performance and generalization of the multimodal fusion.
