Table of Contents
Fetching ...

PVTransformer: Point-to-Voxel Transformer for Scalable 3D Object Detection

Zhaoqi Leng, Pei Sun, Tong He, Dragomir Anguelov, Mingxing Tan

TL;DR

This work tackles the information bottleneck arising from pooling in voxel-based 3D detectors by introducing PVTransformer, a Transformer-based point-to-voxel encoder. By treating points inside each voxel as tokens and aggregating them with a learnable latent/residual query, PVTransformer learns a rich voxel representation via attention, enabling scalable improvement over PointNet-based pooling. Extensive Waymo Open Dataset experiments demonstrate state-of-the-art performance (e.g., 76.5 mAPH L2 on the test set) and favorable scaling behavior compared with prior transformer- and voxel-based detectors. The findings suggest that learnable point-to-voxel aggregation substantially enhances both accuracy and scalability for 3D object detection in sparse LiDAR data.

Abstract

3D object detectors for point clouds often rely on a pooling-based PointNet to encode sparse points into grid-like voxels or pillars. In this paper, we identify that the common PointNet design introduces an information bottleneck that limits 3D object detection accuracy and scalability. To address this limitation, we propose PVTransformer: a transformer-based point-to-voxel architecture for 3D detection. Our key idea is to replace the PointNet pooling operation with an attention module, leading to a better point-to-voxel aggregation function. Our design respects the permutation invariance of sparse 3D points while being more expressive than the pooling-based PointNet. Experimental results show our PVTransformer achieves much better performance compared to the latest 3D object detectors. On the widely used Waymo Open Dataset, our PVTransformer achieves state-of-the-art 76.5 mAPH L2, outperforming the prior art of SWFormer by +1.7 mAPH L2.

PVTransformer: Point-to-Voxel Transformer for Scalable 3D Object Detection

TL;DR

This work tackles the information bottleneck arising from pooling in voxel-based 3D detectors by introducing PVTransformer, a Transformer-based point-to-voxel encoder. By treating points inside each voxel as tokens and aggregating them with a learnable latent/residual query, PVTransformer learns a rich voxel representation via attention, enabling scalable improvement over PointNet-based pooling. Extensive Waymo Open Dataset experiments demonstrate state-of-the-art performance (e.g., 76.5 mAPH L2 on the test set) and favorable scaling behavior compared with prior transformer- and voxel-based detectors. The findings suggest that learnable point-to-voxel aggregation substantially enhances both accuracy and scalability for 3D object detection in sparse LiDAR data.

Abstract

3D object detectors for point clouds often rely on a pooling-based PointNet to encode sparse points into grid-like voxels or pillars. In this paper, we identify that the common PointNet design introduces an information bottleneck that limits 3D object detection accuracy and scalability. To address this limitation, we propose PVTransformer: a transformer-based point-to-voxel architecture for 3D detection. Our key idea is to replace the PointNet pooling operation with an attention module, leading to a better point-to-voxel aggregation function. Our design respects the permutation invariance of sparse 3D points while being more expressive than the pooling-based PointNet. Experimental results show our PVTransformer achieves much better performance compared to the latest 3D object detectors. On the widely used Waymo Open Dataset, our PVTransformer achieves state-of-the-art 76.5 mAPH L2, outperforming the prior art of SWFormer by +1.7 mAPH L2.
Paper Structure (15 sections, 5 figures, 7 tables)

This paper contains 15 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: PVTransformer (PVT) as a scalable architecture. PVTransformer addresses the pooling bottleneck in previous voxel-based 3D detectors and demonstrates better scalability compared to scaling PointNet (Scale Point) and voxel architectures (Scale Voxel). The size of each point represents the model Flops. More details are shown in \ref{['fig:point_scaling']}, \ref{['fig:backbone_scaling']}.
  • Figure 2: Overview of the PVTransformer architecture. The PVTransformer architecture contains a point architecture and a voxel architecture. Its novelty lies in the point architecture, substituting PointNet with a novel Transformer design. In the point architecture, points are bucked into pillars, and each is considered as a token. Within a voxel, points undergo a self-attention Transformer followed by a cross-attention Transformer to aggregate point features into voxel features, details further shown in \ref{['fig:point_arch']} (b). The sparse BEV voxel features proceed to the voxel architecture, employing a multi-scale sparse window Transformer (SWFormer Block) sun2022swformer for encoding and CenterNet heads for bounding boxes predictions yin2021centerpoint.
  • Figure 3: Point-to-voxel aggregation in PVTransformer. This module replaces PointNet's max pooling qi2017pointnet with a Transformer layer. (a) The vanilla max pooling layer aggregates point features by selecting the element-wise max feature $(A_i, B_j, C_k, D_l)$ from 3 point features to form the voxel features, where $i, j, k, l \in [1, 3]$. (b) Our proposed residual query uses the sum of max pooled feature and a learnable latent vector $(\overline{A_i}, \overline{B_j}, \overline{C_k}, \overline{D_l})$, i.e. $(A_i + \overline{A_i}, B_i + \overline{B_i}, C_i + \overline{C_i}, D_i + \overline{D_i})$, to query and aggregate point features.
  • Figure 4: PVTransformer: better scalability. Increasing PointNet's (PN) depth (red, purple) and channel (yellow) yields modest performance improvements, while scaling PVTransformer PVT (green) shows significant improvements. Previous works, both single-scale (SS) singlestride21 and multi-scale (MS) sun2022swformer architectures, use PointNet for point feature aggregation, yet it underperforms when scaled beyond certain thresholds, leading to overfitting. PVTransformer (green) overcomes these limitations by incorporating a Transformer-based point-to-voxel encoder, enabling effective scaling beyond 300 GFlops and achieving 74.0 mAPH L2 for vehicle and pedestrian detection on the Waymo Open Dataset Validation set.
  • Figure 5: The voxel architecture has limited scalability when using PointNet (PN) to aggregate point features.Right: Using Transformer to aggregate point features (PVT_L) is significantly better (green) compared to using PointNet and scaling only the channels in the voxel architecture to 256 (blue), a 3.5 mAPH L2 increase at a similar Flops. Left: Performance of random sampled voxel architectures from the search space (in \ref{['tab:bb_search_space']}) after trained for 12.8 epochs. We observe that scaling the voxel architecture while using PointNet can lead to suboptimal performance. The Pareto curve (red curve) shows scaling voxel architecture channels from 128 to 192 and 256 channels leads to overfittings. Waymo Open Dataset validation set mAPH L2 on Vehicle and Pedestrian are reported.