Table of Contents
Fetching ...

Point Virtual Transformer

Veerain Sood, Bnalin, Gaurav Pandey

TL;DR

PointViT introduces a transformer-based 3D object detector that jointly reasons over raw LiDAR points and selectively sampled virtual points to address far-field sparsity. The method explores multiple fusion strategies, leverages a heatmap-driven proposal scheme, and uses a sparse 3D backbone to generate compact queries refined by cross-attention over voxel and point tokens. Key contributions include a vote-guided sampling pipeline, a dense alignment step for lifted proto-centers, and a context-aggregating transformer head that fuses multi-modal features with robust geometric supervision. Experiments on KITTI show strong 2D and competitive 3D/BEV performance, with notable gains in easy cases and insights into the trade-offs arising from depth-completion quality and fusion timing. Overall, PointViT demonstrates that selective virtual-point fusion, when integrated with heatmap-guided proposals and efficient attention, can enhance long-range perception while controlling computational cost.

Abstract

LiDAR-based 3D object detectors often struggle to detect far-field objects due to the sparsity of point clouds at long ranges, which limits the availability of reliable geometric cues. To address this, prior approaches augment LiDAR data with depth-completed virtual points derived from RGB images; however, directly incorporating all virtual points leads to increased computational cost and introduces challenges in effectively fusing real and virtual information. We present Point Virtual Transformer (PointViT), a transformer-based 3D object detection framework that jointly reasons over raw LiDAR points and selectively sampled virtual points. The framework examines multiple fusion strategies, ranging from early point-level fusion to BEV-based gated fusion, and analyses their trade-offs in terms of accuracy and efficiency. The fused point cloud is voxelized and encoded using sparse convolutions to form a BEV representation, from which a compact set of high-confidence object queries is initialised and refined through a transformer-based context aggregation module. Experiments on the KITTI benchmark report 91.16% 3D AP, 95.94% BEV AP, and 99.36% AP on the KITTI 2D detection benchmark for the Car class.

Point Virtual Transformer

TL;DR

PointViT introduces a transformer-based 3D object detector that jointly reasons over raw LiDAR points and selectively sampled virtual points to address far-field sparsity. The method explores multiple fusion strategies, leverages a heatmap-driven proposal scheme, and uses a sparse 3D backbone to generate compact queries refined by cross-attention over voxel and point tokens. Key contributions include a vote-guided sampling pipeline, a dense alignment step for lifted proto-centers, and a context-aggregating transformer head that fuses multi-modal features with robust geometric supervision. Experiments on KITTI show strong 2D and competitive 3D/BEV performance, with notable gains in easy cases and insights into the trade-offs arising from depth-completion quality and fusion timing. Overall, PointViT demonstrates that selective virtual-point fusion, when integrated with heatmap-guided proposals and efficient attention, can enhance long-range perception while controlling computational cost.

Abstract

LiDAR-based 3D object detectors often struggle to detect far-field objects due to the sparsity of point clouds at long ranges, which limits the availability of reliable geometric cues. To address this, prior approaches augment LiDAR data with depth-completed virtual points derived from RGB images; however, directly incorporating all virtual points leads to increased computational cost and introduces challenges in effectively fusing real and virtual information. We present Point Virtual Transformer (PointViT), a transformer-based 3D object detection framework that jointly reasons over raw LiDAR points and selectively sampled virtual points. The framework examines multiple fusion strategies, ranging from early point-level fusion to BEV-based gated fusion, and analyses their trade-offs in terms of accuracy and efficiency. The fused point cloud is voxelized and encoded using sparse convolutions to form a BEV representation, from which a compact set of high-confidence object queries is initialised and refined through a transformer-based context aggregation module. Experiments on the KITTI benchmark report 91.16% 3D AP, 95.94% BEV AP, and 99.36% AP on the KITTI 2D detection benchmark for the Car class.
Paper Structure (36 sections, 5 equations, 4 figures, 3 tables)

This paper contains 36 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: BEV heatmaps generated using (a) LiDAR points fused with Virtual Points, and (b) raw LiDAR points only. In (a), the bottom region exhibits stronger activations, indicating enhanced feature representation and denser coverage in previously sparse areas. This improvement allows the model to detect objects more reliably and with greater spatial consistency. Green boxes denote correctly detected objects, Cyan boxes indicate missed detections, and purple boxes represent additional detections enabled by the introduction of Virtual Points.
  • Figure 2: Pipeline: Real and virtual points are fused, voxelized, and encoded by a sparse 3D backbone to produce a Heat Map. Vote-guided query-aware sampling picks up queries from representative points (proto-centers), from the lifted seeds, we extract key and value pairs from nearby points and voxels similar to b6. Queries are extracted by picking up the corresponding (x,y) cell features from the densified Heat Map.
  • Figure 3: Fused real + virtual LiDAR point cloud, where each point is colorized by its corresponding RGB value obtained by back-projection into the image plane using Eqs. (\ref{['eq:proj']})-(\ref{['eq:virt']}). The model's 3D detections (cars) are shown as green bounding boxes.
  • Figure 4: Visualization of fused LiDAR (white) and virtual points (blue). At longer ranges, virtual depth becomes irregular and misaligned with real surfaces, illustrating the non-uniform completion that leads to minor 3D localization drift.