Table of Contents
Fetching ...

Frustum PointNets for 3D Object Detection from RGB-D Data

Charles R. Qi, Wei Liu, Chenxia Wu, Hao Su, Leonidas J. Guibas

TL;DR

<p>Frustum PointNets address 3D object detection from RGB-D data by transforming 2D detections into 3D frustums and applying PointNet-based 3D instance segmentation within each frustum, followed by amodal 3D box estimation. A T-Net aligns object points to a center frame and a corner-loss regularizes joint optimization of center, size, and heading, all trained with a multi-task loss. The approach yields state-of-the-art results on KITTI and SUN-RGBD, running in real-time and robust to occlusion and sparse data, illustrating a scalable 3D-centric pipeline that preserves geometric structure in 3D space. This framework demonstrates the practical impact of integrating 2D proposals with 3D point-net processing for accurate, efficient 3D object detection in both outdoor and indoor scenes, with broad applicability to autonomous driving and robotics.

Abstract

In this work, we study 3D object detection from RGB-D data in both indoor and outdoor scenes. While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans. However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes (region proposal). Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects. Benefited from learning directly in raw point clouds, our method is also able to precisely estimate 3D bounding boxes even under strong occlusion or with very sparse points. Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms the state of the art by remarkable margins while having real-time capability.

Frustum PointNets for 3D Object Detection from RGB-D Data

TL;DR

<p>Frustum PointNets address 3D object detection from RGB-D data by transforming 2D detections into 3D frustums and applying PointNet-based 3D instance segmentation within each frustum, followed by amodal 3D box estimation. A T-Net aligns object points to a center frame and a corner-loss regularizes joint optimization of center, size, and heading, all trained with a multi-task loss. The approach yields state-of-the-art results on KITTI and SUN-RGBD, running in real-time and robust to occlusion and sparse data, illustrating a scalable 3D-centric pipeline that preserves geometric structure in 3D space. This framework demonstrates the practical impact of integrating 2D proposals with 3D point-net processing for accurate, efficient 3D object detection in both outdoor and indoor scenes, with broad applicability to autonomous driving and robotics.

Abstract

In this work, we study 3D object detection from RGB-D data in both indoor and outdoor scenes. While previous methods focus on images or 3D voxels, often obscuring natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans. However, a key challenge of this approach is how to efficiently localize objects in point clouds of large-scale scenes (region proposal). Instead of solely relying on 3D proposals, our method leverages both mature 2D object detectors and advanced 3D deep learning for object localization, achieving efficiency as well as high recall for even small objects. Benefited from learning directly in raw point clouds, our method is also able to precisely estimate 3D bounding boxes even under strong occlusion or with very sparse points. Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms the state of the art by remarkable margins while having real-time capability.

Paper Structure

This paper contains 42 sections, 3 equations, 12 figures, 15 tables.

Figures (12)

  • Figure 1: 3D object detection pipeline. Given RGB-D data, we first generate 2D object region proposals in the RGB image using a CNN. Each 2D region is then extruded to a 3D viewing frustum in which we get a point cloud from depth data. Finally, our frustum PointNet predicts a (oriented and amodal) 3D bounding box for the object from the points in frustum.
  • Figure 2: Frustum PointNets for 3D object detection. We first leverage a 2D CNN object detector to propose 2D regions and classify their content. 2D regions are then lifted to 3D and thus become frustum proposals. Given a point cloud in a frustum ($n\times c$ with $n$ points and $c$ channels of XYZ, intensity etc. for each point), the object instance is segmented by binary classification of each point. Based on the segmented object point cloud ($m \times c$), a light-weight regression PointNet (T-Net) tries to align points by translation such that their centroid is close to amodal box center. At last the box estimation net estimates the amodal 3D bounding box for the object. More illustrations on coordinate systems involved and network input, output are in Fig. \ref{['fig:coordinate']} and Fig. \ref{['fig:network']}.
  • Figure 3: Challenges for 3D detection in frustum point cloud.Left: RGB image with an image region proposal for a person. Right: bird's eye view of the LiDAR points in the extruded frustum from 2D box, where we see a wide spread of points with both foreground occluder (bikes) and background clutter (building).
  • Figure 4: Coordinate systems for point cloud. Artificial points (black dots) are shown to illustrate (a) default camera coordinate; (b) frustum coordinate after rotating frustums to center view (Sec. \ref{['sec:frustum_proposal']}); (c) mask coordinate with object points' centroid at origin (Sec. \ref{['sec:instance_seg']}); (d) object coordinate predicted by T-Net (Sec. \ref{['sec:box_estimation']}).
  • Figure 5: Basic architectures and IO for PointNets. Architecture is illustrated for PointNet++ qi2017pointnetplusplus (v2) models with set abstraction layers and feature propagation layers (for segmentation). Coordinate systems involved are visualized in Fig. \ref{['fig:coordinate']}.
  • ...and 7 more figures