Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D Object Detection

Yuhao Huang; Sanping Zhou; Junjie Zhang; Jinpeng Dong; Nanning Zheng

Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D Object Detection

Yuhao Huang, Sanping Zhou, Junjie Zhang, Jinpeng Dong, Nanning Zheng

TL;DR

This paper develops a sparse voxel-pillar encoder that encodes point clouds into voxel and pillar features through 3D and 2D sparse convolutions respectively, and introduces the Sparse Fusion Layer (SFL), facilitating bidirectional interaction between sparse voxel and pillar features.

Abstract

Efficient representation of point clouds is fundamental for LiDAR-based 3D object detection. While recent grid-based detectors often encode point clouds into either voxels or pillars, the distinctions between these approaches remain underexplored. In this paper, we quantify the differences between the current encoding paradigms and highlight the limited vertical learning within. To tackle these limitations, we introduce a hybrid Voxel-Pillar Fusion network (VPF), which synergistically combines the unique strengths of both voxels and pillars. Specifically, we first develop a sparse voxel-pillar encoder that encodes point clouds into voxel and pillar features through 3D and 2D sparse convolutions respectively, and then introduce the Sparse Fusion Layer (SFL), facilitating bidirectional interaction between sparse voxel and pillar features. Our efficient, fully sparse method can be seamlessly integrated into both dense and sparse detectors. Leveraging this powerful yet straightforward framework, VPF delivers competitive performance, achieving real-time inference speeds on the nuScenes and Waymo Open Dataset. The code will be available.

Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D Object Detection

TL;DR

Abstract

Paper Structure (39 sections, 7 equations, 8 figures, 8 tables)

This paper contains 39 sections, 7 equations, 8 figures, 8 tables.

Introduction
Related Work
Grid-based 3D Object Detection
Multi-Source Feature Fusion
Methodology
Sparse Voxel-Pillar Encoder
Consistent Voxel-Pillar Encoding.
Sparse Conv Block.
Sparse Fusion Layer
Sparse Pooling and Broadcasting.
Sparse Voxel-Pillar Fusion.
Detection Framework
VPF$_\mathrm{de}$.
VPF$_\mathrm{sp}$.
Training Loss.
...and 24 more sections

Figures (8)

Figure 1: Recall vs. Vertical Density comparison. For both dense and sparse detectors yin2021centershi2022pillarnetchen2023voxelnext, pillar-based representations show enhanced recall under low vertical densities, while voxel-based representations tend to excel in high-density scenarios. Notably, our hybrid representation offers consistent improvements across different situations.
Figure 2: The framework of VPF. Point clouds are first processed by the sparse voxel-pillar encoder, which extracts correlated sparse voxel and pillar features. The subsequent Sparse Fusion Layer facilitates bidirectional interaction, capturing supplementary information from both types of sparse features. Together, these components form a hybrid backbone capable of integrating with both dense and sparse detectors.
Figure 3: Consistent voxel-pillar downsampling process. In the downsampling procedure, by equalizing the kernel sizes, strides, and padding operations of 2D and 3D regular sparse convolutions in X-Y dimensions, the consistent BEV occupancy is preserved for sparse voxel and pillar features.
Figure 4: Performance vs. inference latency on WOD val set. Tested on a single 3090 GPU with batch size $1$.
Figure 5: Ablation on deploying steps for SFL.
...and 3 more figures

Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D Object Detection

TL;DR

Abstract

Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (8)