Point Virtual Transformer
Veerain Sood, Bnalin, Gaurav Pandey
TL;DR
PointViT introduces a transformer-based 3D object detector that jointly reasons over raw LiDAR points and selectively sampled virtual points to address far-field sparsity. The method explores multiple fusion strategies, leverages a heatmap-driven proposal scheme, and uses a sparse 3D backbone to generate compact queries refined by cross-attention over voxel and point tokens. Key contributions include a vote-guided sampling pipeline, a dense alignment step for lifted proto-centers, and a context-aggregating transformer head that fuses multi-modal features with robust geometric supervision. Experiments on KITTI show strong 2D and competitive 3D/BEV performance, with notable gains in easy cases and insights into the trade-offs arising from depth-completion quality and fusion timing. Overall, PointViT demonstrates that selective virtual-point fusion, when integrated with heatmap-guided proposals and efficient attention, can enhance long-range perception while controlling computational cost.
Abstract
LiDAR-based 3D object detectors often struggle to detect far-field objects due to the sparsity of point clouds at long ranges, which limits the availability of reliable geometric cues. To address this, prior approaches augment LiDAR data with depth-completed virtual points derived from RGB images; however, directly incorporating all virtual points leads to increased computational cost and introduces challenges in effectively fusing real and virtual information. We present Point Virtual Transformer (PointViT), a transformer-based 3D object detection framework that jointly reasons over raw LiDAR points and selectively sampled virtual points. The framework examines multiple fusion strategies, ranging from early point-level fusion to BEV-based gated fusion, and analyses their trade-offs in terms of accuracy and efficiency. The fused point cloud is voxelized and encoded using sparse convolutions to form a BEV representation, from which a compact set of high-confidence object queries is initialised and refined through a transformer-based context aggregation module. Experiments on the KITTI benchmark report 91.16% 3D AP, 95.94% BEV AP, and 99.36% AP on the KITTI 2D detection benchmark for the Car class.
