Table of Contents
Fetching ...

Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D Object Detection

Yuhao Huang, Sanping Zhou, Junjie Zhang, Jinpeng Dong, Nanning Zheng

TL;DR

This paper develops a sparse voxel-pillar encoder that encodes point clouds into voxel and pillar features through 3D and 2D sparse convolutions respectively, and introduces the Sparse Fusion Layer (SFL), facilitating bidirectional interaction between sparse voxel and pillar features.

Abstract

Efficient representation of point clouds is fundamental for LiDAR-based 3D object detection. While recent grid-based detectors often encode point clouds into either voxels or pillars, the distinctions between these approaches remain underexplored. In this paper, we quantify the differences between the current encoding paradigms and highlight the limited vertical learning within. To tackle these limitations, we introduce a hybrid Voxel-Pillar Fusion network (VPF), which synergistically combines the unique strengths of both voxels and pillars. Specifically, we first develop a sparse voxel-pillar encoder that encodes point clouds into voxel and pillar features through 3D and 2D sparse convolutions respectively, and then introduce the Sparse Fusion Layer (SFL), facilitating bidirectional interaction between sparse voxel and pillar features. Our efficient, fully sparse method can be seamlessly integrated into both dense and sparse detectors. Leveraging this powerful yet straightforward framework, VPF delivers competitive performance, achieving real-time inference speeds on the nuScenes and Waymo Open Dataset. The code will be available.

Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D Object Detection

TL;DR

This paper develops a sparse voxel-pillar encoder that encodes point clouds into voxel and pillar features through 3D and 2D sparse convolutions respectively, and introduces the Sparse Fusion Layer (SFL), facilitating bidirectional interaction between sparse voxel and pillar features.

Abstract

Efficient representation of point clouds is fundamental for LiDAR-based 3D object detection. While recent grid-based detectors often encode point clouds into either voxels or pillars, the distinctions between these approaches remain underexplored. In this paper, we quantify the differences between the current encoding paradigms and highlight the limited vertical learning within. To tackle these limitations, we introduce a hybrid Voxel-Pillar Fusion network (VPF), which synergistically combines the unique strengths of both voxels and pillars. Specifically, we first develop a sparse voxel-pillar encoder that encodes point clouds into voxel and pillar features through 3D and 2D sparse convolutions respectively, and then introduce the Sparse Fusion Layer (SFL), facilitating bidirectional interaction between sparse voxel and pillar features. Our efficient, fully sparse method can be seamlessly integrated into both dense and sparse detectors. Leveraging this powerful yet straightforward framework, VPF delivers competitive performance, achieving real-time inference speeds on the nuScenes and Waymo Open Dataset. The code will be available.
Paper Structure (39 sections, 7 equations, 8 figures, 8 tables)

This paper contains 39 sections, 7 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Recall vs. Vertical Density comparison. For both dense and sparse detectors yin2021centershi2022pillarnetchen2023voxelnext, pillar-based representations show enhanced recall under low vertical densities, while voxel-based representations tend to excel in high-density scenarios. Notably, our hybrid representation offers consistent improvements across different situations.
  • Figure 2: The framework of VPF. Point clouds are first processed by the sparse voxel-pillar encoder, which extracts correlated sparse voxel and pillar features. The subsequent Sparse Fusion Layer facilitates bidirectional interaction, capturing supplementary information from both types of sparse features. Together, these components form a hybrid backbone capable of integrating with both dense and sparse detectors.
  • Figure 3: Consistent voxel-pillar downsampling process. In the downsampling procedure, by equalizing the kernel sizes, strides, and padding operations of 2D and 3D regular sparse convolutions in X-Y dimensions, the consistent BEV occupancy is preserved for sparse voxel and pillar features.
  • Figure 4: Performance vs. inference latency on WOD val set. Tested on a single 3090 GPU with batch size $1$.
  • Figure 5: Ablation on deploying steps for SFL.
  • ...and 3 more figures