Table of Contents
Fetching ...

CVCP-Fusion: On Implicit Depth Estimation for 3D Bounding Box Prediction

Pranav Gupta, Rishabh Rengarajan, Viren Bankapur, Vedansh Mannem, Lakshit Ahuja, Surya Vijay, Kevin Wang

TL;DR

It is found that while an implicitly calculated depth-estimate may be sufficiently accurate in a 2D map-view representation, explicitly calculated geometric and spacial information is needed for precise bounding box prediction in the 3D world-view space.

Abstract

Combining LiDAR and Camera-view data has become a common approach for 3D Object Detection. However, previous approaches combine the two input streams at a point-level, throwing away semantic information derived from camera features. In this paper we propose Cross-View Center Point-Fusion, a state-of-the-art model to perform 3D object detection by combining camera and LiDAR-derived features in the BEV space to preserve semantic density from the camera stream while incorporating spacial data from the LiDAR stream. Our architecture utilizes aspects from previously established algorithms, Cross-View Transformers and CenterPoint, and runs their backbones in parallel, allowing efficient computation for real-time processing and application. In this paper we find that while an implicitly calculated depth-estimate may be sufficiently accurate in a 2D map-view representation, explicitly calculated geometric and spacial information is needed for precise bounding box prediction in the 3D world-view space.

CVCP-Fusion: On Implicit Depth Estimation for 3D Bounding Box Prediction

TL;DR

It is found that while an implicitly calculated depth-estimate may be sufficiently accurate in a 2D map-view representation, explicitly calculated geometric and spacial information is needed for precise bounding box prediction in the 3D world-view space.

Abstract

Combining LiDAR and Camera-view data has become a common approach for 3D Object Detection. However, previous approaches combine the two input streams at a point-level, throwing away semantic information derived from camera features. In this paper we propose Cross-View Center Point-Fusion, a state-of-the-art model to perform 3D object detection by combining camera and LiDAR-derived features in the BEV space to preserve semantic density from the camera stream while incorporating spacial data from the LiDAR stream. Our architecture utilizes aspects from previously established algorithms, Cross-View Transformers and CenterPoint, and runs their backbones in parallel, allowing efficient computation for real-time processing and application. In this paper we find that while an implicitly calculated depth-estimate may be sufficiently accurate in a 2D map-view representation, explicitly calculated geometric and spacial information is needed for precise bounding box prediction in the 3D world-view space.

Paper Structure

This paper contains 15 sections, 2 figures.

Figures (2)

  • Figure 1: An overview of our proposed model architecture. For each image, we extract image features across multiple scales. Using known camera pose and intrinsics, we construct a camera-aware positional embedding. We learn a map-view positional embedding that aggregates information from all views through a series of cross attention layers. This is passed through a decoder which converts the learned embeddings into the 3D BEV space. In parallel, corresponding LiDAR data is passed through a Point-Pillars network and up-scaled using an MLP to the 3D BEV space. This is concatenated with the camera-derived BEV and convolved. Then, a 3D CNN-architecture detection-head finds object centers and regresses to full 3D bounding boxes using center features. This box prediction is used to extract point features at the 3D centers of each face of the estimated 3D bounding box, which are passed into an MLP to predict an IoU-guided confidence score and box regression refinement, allowing rotation of predicted 3D bounding boxes.
  • Figure 2: Perspective, front, and top-down views, respectively. These three cars and their corresponding predictions sampled from the nuScenes dataset (ground truth in blue and predictions in red) illustrate an example of CVCP-Fusion performing well at predicting rotation and bounding box coordinates in the x and y-dimensions, but failing to accurately predict in the the z-dimension.