Table of Contents
Fetching ...

VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection

Yin Zhou, Oncel Tuzel

TL;DR

The paper addresses LiDAR-based 3D object detection by removing hand-crafted feature design and enabling end-to-end learning directly from raw point clouds. It introduces VoxelNet, which voxelizes space, applies VFE layers to learn per-voxel features, and uses sparse 3D convolutions followed by an RPN to predict 3D bounding boxes. The method achieves state-of-the-art results on KITTI for cars and shows encouraging performance for pedestrians and cyclists using only LiDAR, aided by an efficient implementation that exploits voxel sparsity for scalable training. This work significantly advances end-to-end 3D detection by bridging point-set feature learning with region proposals in a unified, GPU-friendly framework.

Abstract

Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. Specifically, VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way, the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.

VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection

TL;DR

The paper addresses LiDAR-based 3D object detection by removing hand-crafted feature design and enabling end-to-end learning directly from raw point clouds. It introduces VoxelNet, which voxelizes space, applies VFE layers to learn per-voxel features, and uses sparse 3D convolutions followed by an RPN to predict 3D bounding boxes. The method achieves state-of-the-art results on KITTI for cars and shows encouraging performance for pedestrians and cyclists using only LiDAR, aided by an efficient implementation that exploits voxel sparsity for scalable training. This work significantly advances end-to-end 3D detection by bridging point-set feature learning with region proposals in a unified, GPU-friendly framework.

Abstract

Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. Specifically, VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way, the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.

Paper Structure

This paper contains 21 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: VoxelNet directly operates on the raw point cloud (no need for feature engineering) and produces the 3D detection results using a single end-to-end trainable network.
  • Figure 2: VoxelNet architecture. The feature learning network takes a raw point cloud as input, partitions the space into voxels, and transforms points within each voxel to a vector representation characterizing the shape information. The space is represented as a sparse 4D tensor. The convolutional middle layers processes the 4D tensor to aggregate spatial context. Finally, a RPN generates the 3D detection.
  • Figure 3: Voxel feature encoding layer.
  • Figure 4: Region proposal network architecture.
  • Figure 5: Illustration of efficient implementation.
  • ...and 1 more figures