Table of Contents
Fetching ...

CoSEC: A Coaxial Stereo Event Camera Dataset for Autonomous Driving

Shihan Peng, Hanyu Zhou, Hao Dong, Zhiwei Shi, Haoyue Liu, Yuxing Duan, Yi Chang, Luxin Yan

TL;DR

CoSEC tackles the challenge of cross-modal alignment in autonomous driving by introducing coaxial beam-splitter based event-frame devices that minimize the baseline between event and frame cameras, enabling pixel-level spatiotemporal fusion. The authors build a coaxial stereo event-camera dataset with LiDAR, IMU, and GNSS, and establish time synchronization and calibration pipelines, generating high-quality ground-truth depth and optical flow through LiDAR-SLAM fusion and event-frame enhancement for nighttime conditions. They demonstrate that coaxial alignment improves multimodal fusion performance and generalization, particularly in low-light/nighttime scenes, outperforming parallel-layout datasets. The dataset covers all-day sequences across diverse environments and provides train/test splits to support robust development of cross-modal fusion methods for 3D dynamic scene perception in autonomous driving.

Abstract

Conventional frame camera is the mainstream sensor of the autonomous driving scene perception, while it is limited in adverse conditions, such as low light. Event camera with high dynamic range has been applied in assisting frame camera for the multimodal fusion, which relies heavily on the pixel-level spatial alignment between various modalities. Typically, existing multimodal datasets mainly place event and frame cameras in parallel and directly align them spatially via warping operation. However, this parallel strategy is less effective for multimodal fusion, since the large disparity exacerbates spatial misalignment due to the large event-frame baseline. We argue that baseline minimization can reduce alignment error between event and frame cameras. In this work, we introduce hybrid coaxial event-frame devices to build the multimodal system, and propose a coaxial stereo event camera (CoSEC) dataset for autonomous driving. As for the multimodal system, we first utilize the microcontroller to achieve time synchronization, and then spatially calibrate different sensors, where we perform intra- and inter-calibration of stereo coaxial devices. As for the multimodal dataset, we filter LiDAR point clouds to generate depth and optical flow labels using reference depth, which is further improved by fusing aligned event and frame data in nighttime conditions. With the help of the coaxial device, the proposed dataset can promote the all-day pixel-level multimodal fusion. Moreover, we also conduct experiments to demonstrate that the proposed dataset can improve the performance and generalization of the multimodal fusion.

CoSEC: A Coaxial Stereo Event Camera Dataset for Autonomous Driving

TL;DR

CoSEC tackles the challenge of cross-modal alignment in autonomous driving by introducing coaxial beam-splitter based event-frame devices that minimize the baseline between event and frame cameras, enabling pixel-level spatiotemporal fusion. The authors build a coaxial stereo event-camera dataset with LiDAR, IMU, and GNSS, and establish time synchronization and calibration pipelines, generating high-quality ground-truth depth and optical flow through LiDAR-SLAM fusion and event-frame enhancement for nighttime conditions. They demonstrate that coaxial alignment improves multimodal fusion performance and generalization, particularly in low-light/nighttime scenes, outperforming parallel-layout datasets. The dataset covers all-day sequences across diverse environments and provides train/test splits to support robust development of cross-modal fusion methods for 3D dynamic scene perception in autonomous driving.

Abstract

Conventional frame camera is the mainstream sensor of the autonomous driving scene perception, while it is limited in adverse conditions, such as low light. Event camera with high dynamic range has been applied in assisting frame camera for the multimodal fusion, which relies heavily on the pixel-level spatial alignment between various modalities. Typically, existing multimodal datasets mainly place event and frame cameras in parallel and directly align them spatially via warping operation. However, this parallel strategy is less effective for multimodal fusion, since the large disparity exacerbates spatial misalignment due to the large event-frame baseline. We argue that baseline minimization can reduce alignment error between event and frame cameras. In this work, we introduce hybrid coaxial event-frame devices to build the multimodal system, and propose a coaxial stereo event camera (CoSEC) dataset for autonomous driving. As for the multimodal system, we first utilize the microcontroller to achieve time synchronization, and then spatially calibrate different sensors, where we perform intra- and inter-calibration of stereo coaxial devices. As for the multimodal dataset, we filter LiDAR point clouds to generate depth and optical flow labels using reference depth, which is further improved by fusing aligned event and frame data in nighttime conditions. With the help of the coaxial device, the proposed dataset can promote the all-day pixel-level multimodal fusion. Moreover, we also conduct experiments to demonstrate that the proposed dataset can improve the performance and generalization of the multimodal fusion.
Paper Structure (17 sections, 2 equations, 7 figures, 5 tables)

This paper contains 17 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of the multimodal system and dataset. As for the multimodal system, we introduce the beam splitter to design the coaxial event-frame device for the pixel-level spatial alignment, and then build the coaxial stereo multimodal imaging system with the LiDAR and INS. As for the multimodal dataset, we fuse the aligned event-image data and the LiDAR point cloud to generate the ground truth depth and optical flow. In this work, we utilize the multimodal system to collect the coaxial stereo event camera dataset for autonomous driving.
  • Figure 2: Difference between different event-frame placement strategies. Left: The large baseline leads to the large disparity, resulting in the misalignment between event and frame camera. Middle: The small baseline brings in a small but non-negligible disparity. Right: Baseline minimization can reduce the disparity. Therefore, we introduce a coaxial strategy to relieve the spatial alignment error between event and frame camera.
  • Figure 3: Comparison between parallel and coaxial strategies. Parallel placement strategy brings in spatial alignment error between the event and frame data in local regions. In contrast, we introduce a coaxial strategy to improve the pixel-level spatial alignment between the event and frame data.
  • Figure 4: Calibration of the stereo coaxial devices. We perform intra-calibration within the single coaxial device and inter-calibration between stereo coaxial devices. During intra-calibration, we first reconstruct events into event frames for standard calibration, and then align the event and image data via warping operation. During inter-calibration, we further take stereo rectification to obtain the paired rectified event-image data.
  • Figure 5: Pipeline of ground truth generation. We first fuse single clouds within a time window into a local cloud via SLAM, and then project the local fused cloud into the camera coordinate system for coarse depth. Next, we design an outlier removal module, which estimates reference depth from the input image to filter the coarse depth for ground truth depth and optical flow. In addition, we introduce an event-frame fusion strategy to enhance the nighttime low-light image for achieving better reference depth, thus improving the accuracy of ground truth generation.
  • ...and 2 more figures