Table of Contents
Fetching ...

PointPillars: Fast Encoders for Object Detection from Point Clouds

Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, Oscar Beijbom

TL;DR

PointPillars addresses real-time 3D object detection from lidar by learning a pillar-based encoder that converts sparse point clouds into a dense 2D pseudo-image for a 2D CNN backbone. It combines a Pillar Feature Network with a lightweight backbone and an SSD-style head to predict oriented 3D boxes, achieving state-of-the-art results on KITTI while running at real-time speeds (62 Hz, with a 105 Hz variant). The approach demonstrates strong advantages over fixed encoders and slow voxel-based methods, and ablations show how pillar size, decorations, and data augmentation impact performance. This work establishes a practical, end-to-end learnable encoding for fast, accurate lidar-based 3D detection suitable for autonomous systems, with public code availability.

Abstract

Object detection in point clouds is an important aspect of many robotics applications such as autonomous driving. In this paper we consider the problem of encoding a point cloud into a format appropriate for a downstream detection pipeline. Recent literature suggests two types of encoders; fixed encoders tend to be fast but sacrifice accuracy, while encoders that are learned from data are more accurate, but slower. In this work we propose PointPillars, a novel encoder which utilizes PointNets to learn a representation of point clouds organized in vertical columns (pillars). While the encoded features can be used with any standard 2D convolutional detection architecture, we further propose a lean downstream network. Extensive experimentation shows that PointPillars outperforms previous encoders with respect to both speed and accuracy by a large margin. Despite only using lidar, our full detection pipeline significantly outperforms the state of the art, even among fusion methods, with respect to both the 3D and bird's eye view KITTI benchmarks. This detection performance is achieved while running at 62 Hz: a 2 - 4 fold runtime improvement. A faster version of our method matches the state of the art at 105 Hz. These benchmarks suggest that PointPillars is an appropriate encoding for object detection in point clouds.

PointPillars: Fast Encoders for Object Detection from Point Clouds

TL;DR

PointPillars addresses real-time 3D object detection from lidar by learning a pillar-based encoder that converts sparse point clouds into a dense 2D pseudo-image for a 2D CNN backbone. It combines a Pillar Feature Network with a lightweight backbone and an SSD-style head to predict oriented 3D boxes, achieving state-of-the-art results on KITTI while running at real-time speeds (62 Hz, with a 105 Hz variant). The approach demonstrates strong advantages over fixed encoders and slow voxel-based methods, and ablations show how pillar size, decorations, and data augmentation impact performance. This work establishes a practical, end-to-end learnable encoding for fast, accurate lidar-based 3D detection suitable for autonomous systems, with public code availability.

Abstract

Object detection in point clouds is an important aspect of many robotics applications such as autonomous driving. In this paper we consider the problem of encoding a point cloud into a format appropriate for a downstream detection pipeline. Recent literature suggests two types of encoders; fixed encoders tend to be fast but sacrifice accuracy, while encoders that are learned from data are more accurate, but slower. In this work we propose PointPillars, a novel encoder which utilizes PointNets to learn a representation of point clouds organized in vertical columns (pillars). While the encoded features can be used with any standard 2D convolutional detection architecture, we further propose a lean downstream network. Extensive experimentation shows that PointPillars outperforms previous encoders with respect to both speed and accuracy by a large margin. Despite only using lidar, our full detection pipeline significantly outperforms the state of the art, even among fusion methods, with respect to both the 3D and bird's eye view KITTI benchmarks. This detection performance is achieved while running at 62 Hz: a 2 - 4 fold runtime improvement. A faster version of our method matches the state of the art at 105 Hz. These benchmarks suggest that PointPillars is an appropriate encoding for object detection in point clouds.

Paper Structure

This paper contains 32 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Bird's eye view performance vs speed for our proposed PointPillars, PP method on the KITTI kitti test set. Lidar-only methods drawn as blue circles; lidar & vision methods drawn as red squares. Also drawn are top methods from the KITTI leaderboard: M: MV3D mv3d, A AVOD avod, C: ContFuse contfuse, V: VoxelNet voxelnet, F: Frustum PointNet frustum, S: SECOND second, P+ PIXOR++ hdnet. PointPillars outperforms all other lidar-only methods in terms of both speed and accuracy by a large margin. It also outperforms all fusion based method except on pedestrians. Similar performance is achieved on the 3D metric (Table \ref{['table:res_3d']}).
  • Figure 2: Network overview. The main components of the network are a Pillar Feature Network, Backbone, and SSD Detection Head. See Section \ref{['sec:network']} for more details. The raw point cloud is converted to a stacked pillar tensor and pillar index tensor. The encoder uses the stacked pillars to learn a set of features that can be scattered back to a 2D pseudo-image for a convolutional neural network. The features from the backbone are used by the detection head to predict 3D bounding boxes for objects. Note: here we show the backbone dimensions for the car network.
  • Figure 3: Qualitative analysis of KITTI results. We show a bird's-eye view of the lidar point cloud (top), as well as the 3D bounding boxes projected into the image for clearer visualization. Note that our method only uses lidar. We show predicted boxes for car (orange), cyclist (red) and pedestrian (blue). Ground truth boxes are shown in gray. The orientation of boxes is shown by a line connected the bottom center to the front of the box.
  • Figure 4: Failure cases on KITTI. Same visualize setup from Figure \ref{['fig:kitti_visualize']} but focusing on several common failure modes.
  • Figure 5: BEV detection performance (mAP) vs speed (Hz) on the KITTI kitti val set across pedestrians, bicycles and cars. Blue circles indicate lidar only methods, red squares indicate methods that use lidar & vision. Different operating points were achieved by using pillar grid sizes in $\{0.12^2, 0.16^2, 0.2^2, 0.24^2, 0.28^2\}$$m^2$. The number of max-pillars was varied along with the resolution and set to $16000, 12000, 12000, 8000, 8000$ respectively.