Table of Contents
Fetching ...

Ev-3DOD: Pushing the Temporal Boundaries of 3D Object Detection with Event Cameras

Hoonhee Cho, Jae-young Kang, Youngho Kim, Kuk-Jin Yoon

TL;DR

This work tackles latency in 3D object detection for autonomous driving by introducing asynchronous event cameras into a multi-modal framework. Ev-3DOD combines active-timestamp RGB-LiDAR detection with Virtual 3D Event Fusion to estimate 3D motion during blind times using high-temporal-resolution event data, supplemented by a Motion Confidence Estimator to modulate scores. The authors also publish two datasets, Ev-Waymo and DSEC-3DOD, with 100 FPS ground truth to enable robust evaluation of event-based 3D detectors. Experiments show Ev-3DOD significantly improves online performance during blind times and approaches offline oracle performance, while achieving fast inference, especially with the lighter Ev-3DOD-Small variant. Overall, this work demonstrates the practical viability of neuromorphic cameras for high-temporal-resolution 3D perception in dynamic driving scenarios.

Abstract

Detecting 3D objects in point clouds plays a crucial role in autonomous driving systems. Recently, advanced multi-modal methods incorporating camera information have achieved notable performance. For a safe and effective autonomous driving system, algorithms that excel not only in accuracy but also in speed and low latency are essential. However, existing algorithms fail to meet these requirements due to the latency and bandwidth limitations of fixed frame rate sensors, e.g., LiDAR and camera. To address this limitation, we introduce asynchronous event cameras into 3D object detection for the first time. We leverage their high temporal resolution and low bandwidth to enable high-speed 3D object detection. Our method enables detection even during inter-frame intervals when synchronized data is unavailable, by retrieving previous 3D information through the event camera. Furthermore, we introduce the first event-based 3D object detection dataset, DSEC-3DOD, which includes ground-truth 3D bounding boxes at 100 FPS, establishing the first benchmark for event-based 3D detectors. The code and dataset are available at https://github.com/mickeykang16/Ev3DOD.

Ev-3DOD: Pushing the Temporal Boundaries of 3D Object Detection with Event Cameras

TL;DR

This work tackles latency in 3D object detection for autonomous driving by introducing asynchronous event cameras into a multi-modal framework. Ev-3DOD combines active-timestamp RGB-LiDAR detection with Virtual 3D Event Fusion to estimate 3D motion during blind times using high-temporal-resolution event data, supplemented by a Motion Confidence Estimator to modulate scores. The authors also publish two datasets, Ev-Waymo and DSEC-3DOD, with 100 FPS ground truth to enable robust evaluation of event-based 3D detectors. Experiments show Ev-3DOD significantly improves online performance during blind times and approaches offline oracle performance, while achieving fast inference, especially with the lighter Ev-3DOD-Small variant. Overall, this work demonstrates the practical viability of neuromorphic cameras for high-temporal-resolution 3D perception in dynamic driving scenarios.

Abstract

Detecting 3D objects in point clouds plays a crucial role in autonomous driving systems. Recently, advanced multi-modal methods incorporating camera information have achieved notable performance. For a safe and effective autonomous driving system, algorithms that excel not only in accuracy but also in speed and low latency are essential. However, existing algorithms fail to meet these requirements due to the latency and bandwidth limitations of fixed frame rate sensors, e.g., LiDAR and camera. To address this limitation, we introduce asynchronous event cameras into 3D object detection for the first time. We leverage their high temporal resolution and low bandwidth to enable high-speed 3D object detection. Our method enables detection even during inter-frame intervals when synchronized data is unavailable, by retrieving previous 3D information through the event camera. Furthermore, we introduce the first event-based 3D object detection dataset, DSEC-3DOD, which includes ground-truth 3D bounding boxes at 100 FPS, establishing the first benchmark for event-based 3D detectors. The code and dataset are available at https://github.com/mickeykang16/Ev3DOD.

Paper Structure

This paper contains 23 sections, 5 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: Comparison between the proposed Ev-3DOD and conventional 3D detection methods. Fixed frame rate sensors (e.g., LiDAR, frame camera) inevitably have periods of "blind time" ($t_{0\rightarrow1}$) when they lack data between active timestamps (0 and 1). Consequently, as shown in (a), conventional methods cannot detect objects during blind time, while other objects like vehicles can undergo significant motion within these gaps. In contrast, as illustrated in (b), our method, Ev-3DOD, leverages the high temporal resolution of the event camera, enabling accurate object detection even during the blind times of other sensors.
  • Figure 2: The overall pipeline for event-based 3D object detection (Ev-3DOD). (a): At active timestamp 0, LiDAR and image data are available. Therefore, we utilize an RGB-LiDAR Region Proposal Network (RPN) to extract voxel features and 3D bounding boxes at the active timestamp. (b): To predict the bounding box during blind time, $0\leq t<1$, we estimate the 3D motion and confidence score using event features. For computational efficiency, we design the process to compute (a) only once before the next active timestamp while performing iterative computations solely for (b).
  • Figure 3: Virtual 3D Event Fusion (V3D-EF). We generate an implicit motion field for each bounding box by applying the ROI pooling separately to the virtual 3D event features, which represent the 3D projection of events, and the voxel features.
  • Figure 4: DSEC 3D Object Detection Dataset samples. Event data, label overlaid image, and accumulated LiDAR from left to right.
  • Figure 5: Qualitative comparisons of our method with other offline and online evaluations on the Ev-Waymo dataset. $t=0$ represents the active time, while $t = 0.2, 0.4, 0.6, 0.8$ denotes the blind times. The blue box represents the ground truth, while the red box shows the prediction results of each method. For easier understanding, images at active timestamps 0 and 1 are overlaid.
  • ...and 11 more figures