Table of Contents
Fetching ...

Towards Consistent Object Detection via LiDAR-Camera Synergy

Kai Luo, Hao Wu, Kefu Yi, Kailun Yang, Wei Hao, Rongdong Hu

TL;DR

The paper addresses the challenge of achieving consistent cross-modal object detection across LiDAR and camera data. It introduces an end-to-end Consistency Object Detection (COD) framework that uses LiDAR proposals to initialize image queries, enabling simultaneous 3D and 2D detections with the same object identity in one forward pass, and it proposes the Consistency Precision (CP) metric to quantify cross-modal correspondence. The method combines a configurable LiDAR detector with an RT-DETR image detector, employing learnable query initialization, Hungarian-based matching, and a joint training loss, and it demonstrates robustness to calibration inaccuracies on KITTI and DAIR-V2X benchmarks. The work provides new benchmarks and demonstrates strong cross-modal consistency, offering a practical approach for robust multimodal perception in driving scenes and potential human-machine interaction applications.

Abstract

As human-machine interaction continues to evolve, the capacity for environmental perception is becoming increasingly crucial. Integrating the two most common types of sensory data, images, and point clouds, can enhance detection accuracy. Currently, there is no existing model capable of detecting an object's position in both point clouds and images while also determining their corresponding relationship. This information is invaluable for human-machine interactions, offering new possibilities for their enhancement. In light of this, this paper introduces an end-to-end Consistency Object Detection (COD) algorithm framework that requires only a single forward inference to simultaneously obtain an object's position in both point clouds and images and establish their correlation. Furthermore, to assess the accuracy of the object correlation between point clouds and images, this paper proposes a new evaluation metric, Consistency Precision (CP). To verify the effectiveness of the proposed framework, an extensive set of experiments has been conducted on the KITTI and DAIR-V2X datasets. The study also explored how the proposed consistency detection method performs on images when the calibration parameters between images and point clouds are disturbed, compared to existing post-processing methods. The experimental results demonstrate that the proposed method exhibits excellent detection performance and robustness, achieving end-to-end consistency detection. The source code will be made publicly available at https://github.com/xifen523/COD.

Towards Consistent Object Detection via LiDAR-Camera Synergy

TL;DR

The paper addresses the challenge of achieving consistent cross-modal object detection across LiDAR and camera data. It introduces an end-to-end Consistency Object Detection (COD) framework that uses LiDAR proposals to initialize image queries, enabling simultaneous 3D and 2D detections with the same object identity in one forward pass, and it proposes the Consistency Precision (CP) metric to quantify cross-modal correspondence. The method combines a configurable LiDAR detector with an RT-DETR image detector, employing learnable query initialization, Hungarian-based matching, and a joint training loss, and it demonstrates robustness to calibration inaccuracies on KITTI and DAIR-V2X benchmarks. The work provides new benchmarks and demonstrates strong cross-modal consistency, offering a practical approach for robust multimodal perception in driving scenes and potential human-machine interaction applications.

Abstract

As human-machine interaction continues to evolve, the capacity for environmental perception is becoming increasingly crucial. Integrating the two most common types of sensory data, images, and point clouds, can enhance detection accuracy. Currently, there is no existing model capable of detecting an object's position in both point clouds and images while also determining their corresponding relationship. This information is invaluable for human-machine interactions, offering new possibilities for their enhancement. In light of this, this paper introduces an end-to-end Consistency Object Detection (COD) algorithm framework that requires only a single forward inference to simultaneously obtain an object's position in both point clouds and images and establish their correlation. Furthermore, to assess the accuracy of the object correlation between point clouds and images, this paper proposes a new evaluation metric, Consistency Precision (CP). To verify the effectiveness of the proposed framework, an extensive set of experiments has been conducted on the KITTI and DAIR-V2X datasets. The study also explored how the proposed consistency detection method performs on images when the calibration parameters between images and point clouds are disturbed, compared to existing post-processing methods. The experimental results demonstrate that the proposed method exhibits excellent detection performance and robustness, achieving end-to-end consistency detection. The source code will be made publicly available at https://github.com/xifen523/COD.
Paper Structure (21 sections, 7 equations, 2 figures, 5 tables)

This paper contains 21 sections, 7 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: (a) and (c) demonstrate the requirement in consistency detection to simultaneously detect the position of an object in both point clouds and images, with the same object marked with the same ID in both modalities. (b) demonstrates the precision of bounding box detection in images on the KITTI dataset for both the original method and the consistency detection method (ours), under both noisy and noise-free conditions, with the latter showing enhanced robustness.
  • Figure 2: The architecture diagram for the consistency detection network. The overall architecture of consistency detection comprises two pathways: the point cloud object detection pathway and the image object detection pathway. In the former, point cloud features are extracted through a 3D backbone network, transformed via a neck network, and then object positions and dimensions in the point cloud are predicted using a detection head. In the latter, features are extracted through a 2D backbone network and processed through the encoder layer of a transformer to generate a heat map, from which query proposals are derived. Additional query proposals are obtained using the object positions and dimensions acquired from the point cloud. Both sets of query proposals are fed into the decoder layer of the transformer, and the final object positions in the image are obtained via the image's detection head. Notably, during training, the first set of queries is matched with the ground truth to compute loss, while the second set, already corresponding to the ground truth, bypasses the matching process and goes directly to loss calculation.