Table of Contents
Fetching ...

BOX3D: Lightweight Camera-LiDAR Fusion for 3D Object Detection and Localization

Mario A. V. Saucedo, Nikolaos Stathoulopoulos, Vidya Sumathy, Christoforos Kanellakis, George Nikolakopoulos

TL;DR

The paper addresses fast and accurate 3D object detection and localization by fusing RGB camera data with LiDAR. It introduces BOX3D, a three-layer lightweight framework that (1) generates initial 3D bounding boxes from 2D detections projected into the LiDAR frame, (2) merges and refines 3D boxes in the world coordinate frame across time, and (3) refines object geometry on a global LiDAR map via point-to-voxel clustering. The approach leverages YOLOv8n for 2D detection, Euclidean clustering for point-level labeling, and Direct Lidar Odometry for consistent world-frame transforms, achieving a balance between speed and accuracy. Evaluations on KITTI show competitive mean IoU and favorable runtimes, with Layer II as the main computational bottleneck, demonstrating BOX3D’s practical viability for real-time robotics applications. Overall, BOX3D provides robust local and global object localization by integrating sequential observations into a coherent global map while maintaining efficiency through a three-layer design.

Abstract

Object detection and global localization play a crucial role in robotics, spanning across a great spectrum of applications from autonomous cars to multi-layered 3D Scene Graphs for semantic scene understanding. This article proposes BOX3D, a novel multi-modal and lightweight scheme for localizing objects of interest by fusing the information from RGB camera and 3D LiDAR. BOX3D is structured around a three-layered architecture, building up from the local perception of the incoming sequential sensor data to the global perception refinement that covers for outliers and the general consistency of each object's observation. More specifically, the first layer handles the low-level fusion of camera and LiDAR data for initial 3D bounding box extraction. The second layer converts each LiDAR's scan 3D bounding boxes to the world coordinate frame and applies a spatial pairing and merging mechanism to maintain the uniqueness of objects observed from different viewpoints. Finally, BOX3D integrates the third layer that supervises the consistency of the results on the global map iteratively, using a point-to-voxel comparison for identifying all points in the global map that belong to the object. Benchmarking results of the proposed novel architecture are showcased in multiple experimental trials on public state-of-the-art large-scale dataset of urban environments.

BOX3D: Lightweight Camera-LiDAR Fusion for 3D Object Detection and Localization

TL;DR

The paper addresses fast and accurate 3D object detection and localization by fusing RGB camera data with LiDAR. It introduces BOX3D, a three-layer lightweight framework that (1) generates initial 3D bounding boxes from 2D detections projected into the LiDAR frame, (2) merges and refines 3D boxes in the world coordinate frame across time, and (3) refines object geometry on a global LiDAR map via point-to-voxel clustering. The approach leverages YOLOv8n for 2D detection, Euclidean clustering for point-level labeling, and Direct Lidar Odometry for consistent world-frame transforms, achieving a balance between speed and accuracy. Evaluations on KITTI show competitive mean IoU and favorable runtimes, with Layer II as the main computational bottleneck, demonstrating BOX3D’s practical viability for real-time robotics applications. Overall, BOX3D provides robust local and global object localization by integrating sequential observations into a coherent global map while maintaining efficiency through a three-layer design.

Abstract

Object detection and global localization play a crucial role in robotics, spanning across a great spectrum of applications from autonomous cars to multi-layered 3D Scene Graphs for semantic scene understanding. This article proposes BOX3D, a novel multi-modal and lightweight scheme for localizing objects of interest by fusing the information from RGB camera and 3D LiDAR. BOX3D is structured around a three-layered architecture, building up from the local perception of the incoming sequential sensor data to the global perception refinement that covers for outliers and the general consistency of each object's observation. More specifically, the first layer handles the low-level fusion of camera and LiDAR data for initial 3D bounding box extraction. The second layer converts each LiDAR's scan 3D bounding boxes to the world coordinate frame and applies a spatial pairing and merging mechanism to maintain the uniqueness of objects observed from different viewpoints. Finally, BOX3D integrates the third layer that supervises the consistency of the results on the global map iteratively, using a point-to-voxel comparison for identifying all points in the global map that belong to the object. Benchmarking results of the proposed novel architecture are showcased in multiple experimental trials on public state-of-the-art large-scale dataset of urban environments.
Paper Structure (13 sections, 1 equation, 5 figures, 2 tables)

This paper contains 13 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Depiction of the proposed framework BOX3D with the inputs and outputs of each of the layers. Pink points denote the detected objects, while white ones correspond to the rest of the world point cloud.
  • Figure 2: Functional block diagram of the proposed framework for lightweight object detection and localization based on camera-LiDAR fusion.
  • Figure 3: Example of the input (a) and outputs of each step on the 3D bounding box generation module, where 2D bounding boxes (b) are mapped to 3D coordinates using the segmentation mask (c) to label the points on the projected point cloud (d).
  • Figure 4: Visualization of the 3D bounding boxes of the detected objects on the global map [B] and of different instances of miss-detection and partial detection [A].
  • Figure 5: Visualization of the 3D bounding boxes of the detected objects on the global map [B] and of different instances of miss-detection and partial detection [A].