Table of Contents
Fetching ...

MsSVT++: Mixed-scale Sparse Voxel Transformer with Center Voting for 3D Object Detection

Jianan Li, Shaocong Dong, Lihe Ding, Tingfa Xu

TL;DR

This work tackles robust 3D object detection in large outdoor LiDAR scenes with objects at multiple scales. It introduces MsSVT++, a Mixed-scale Sparse Voxel Transformer that assembles multi-scale attention through head-group partitioning and scale-aware relative position encoding, with Chessboard Sampling and hash-based sparse operations to maintain efficiency. A Center Voting module fills object centers with voted voxels enriched by mixed-scale context, improving bounding-box localization, especially for large objects. Experiments on Waymo, KITTI, and Argoverse 2 show state-of-the-art performance for single-stage detectors and competitive results for multi-frame inputs, validating the approach and its generalizability across datasets and architectures.

Abstract

Accurate 3D object detection in large-scale outdoor scenes, characterized by considerable variations in object scales, necessitates features rich in both long-range and fine-grained information. While recent detectors have utilized window-based transformers to model long-range dependencies, they tend to overlook fine-grained details. To bridge this gap, we propose MsSVT++, an innovative Mixed-scale Sparse Voxel Transformer that simultaneously captures both types of information through a divide-and-conquer approach. This approach involves explicitly dividing attention heads into multiple groups, each responsible for attending to information within a specific range. The outputs of these groups are subsequently merged to obtain final mixed-scale features. To mitigate the computational complexity associated with applying a window-based transformer in 3D voxel space, we introduce a novel Chessboard Sampling strategy and implement voxel sampling and gathering operations sparsely using a hash map. Moreover, an important challenge stems from the observation that non-empty voxels are primarily located on the surface of objects, which impedes the accurate estimation of bounding boxes. To overcome this challenge, we introduce a Center Voting module that integrates newly voted voxels enriched with mixed-scale contextual information towards the centers of the objects, thereby improving precise object localization. Extensive experiments demonstrate that our single-stage detector, built upon the foundation of MsSVT++, consistently delivers exceptional performance across diverse datasets.

MsSVT++: Mixed-scale Sparse Voxel Transformer with Center Voting for 3D Object Detection

TL;DR

This work tackles robust 3D object detection in large outdoor LiDAR scenes with objects at multiple scales. It introduces MsSVT++, a Mixed-scale Sparse Voxel Transformer that assembles multi-scale attention through head-group partitioning and scale-aware relative position encoding, with Chessboard Sampling and hash-based sparse operations to maintain efficiency. A Center Voting module fills object centers with voted voxels enriched by mixed-scale context, improving bounding-box localization, especially for large objects. Experiments on Waymo, KITTI, and Argoverse 2 show state-of-the-art performance for single-stage detectors and competitive results for multi-frame inputs, validating the approach and its generalizability across datasets and architectures.

Abstract

Accurate 3D object detection in large-scale outdoor scenes, characterized by considerable variations in object scales, necessitates features rich in both long-range and fine-grained information. While recent detectors have utilized window-based transformers to model long-range dependencies, they tend to overlook fine-grained details. To bridge this gap, we propose MsSVT++, an innovative Mixed-scale Sparse Voxel Transformer that simultaneously captures both types of information through a divide-and-conquer approach. This approach involves explicitly dividing attention heads into multiple groups, each responsible for attending to information within a specific range. The outputs of these groups are subsequently merged to obtain final mixed-scale features. To mitigate the computational complexity associated with applying a window-based transformer in 3D voxel space, we introduce a novel Chessboard Sampling strategy and implement voxel sampling and gathering operations sparsely using a hash map. Moreover, an important challenge stems from the observation that non-empty voxels are primarily located on the surface of objects, which impedes the accurate estimation of bounding boxes. To overcome this challenge, we introduce a Center Voting module that integrates newly voted voxels enriched with mixed-scale contextual information towards the centers of the objects, thereby improving precise object localization. Extensive experiments demonstrate that our single-stage detector, built upon the foundation of MsSVT++, consistently delivers exceptional performance across diverse datasets.
Paper Structure (30 sections, 10 equations, 8 figures, 12 tables)

This paper contains 30 sections, 10 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Top: In contrast to sampling key voxels from (b) a single-scale 3D window in (a) raw point clouds, our MsSVT samples key voxels from (c) multi-scale windows, maintaining finer granularity on the target object while covering a large-scale neighborhood. Bottom: The different head groups in our mode accept keys sampled from windows of varying scales and are individually responsible for obtaining (d) fine-grained details and (e) long-range context, as depicted by the higher attention weights. This collaborative effort enables accurate object detection.
  • Figure 2: Top: Architecture of our Mixed-scaled Sparse Voxel Transformer backbone. The backbone network comprises $N$ MsSVT blocks. Bottom: Implementation details of the MsSVT block. Initially, we collect non-empty voxels within the query window and employ Chessboard Sampling (CBS) to sample the queries. For the keys, we gather non-empty voxels from key windows of various sizes individually, and generate multiple sets of keys using Balanced Multi-window Sampling. Each set represents information at a specific scale. To facilitate scale-aware attention learning, keys from windows of different sizes are assigned to distinct head groups, enabling us to capture both long-range context and fine-grained details concurrently.
  • Figure 3: Diagram of the Chessboard Sampling strategy.
  • Figure 4: (a) The overarching architecture of our detection framework. (b) Implementation details of the Object Center Voting module. This module comprises several sequential steps. Initially, it segments foreground voxels and predicts the Euclidean space offset for each foreground voxel center in relation to the object center. This predictive process generates a densely distributed vote point set in proximity to the object's center. Subsequently, the generated point set is re-voxelized, resulting in a new set of voxels. These new voxels are then enriched with mixed-scale context, incorporating information from diverse parts of the object via a mixed-scale attention mechanism. Finally, the new voxels are merged with the original voxels for further processing. Here, BMS stands for Balanced Multi-window Sampling, and SHA represents Scale-aware Head Attention. (c) Diagram depicting the 2D Region Proposal Network (RPN).
  • Figure 5: Qualitative results on Waymo. Ground-truths and predictions are depicted by red and green boxes, respectively. Our method exhibits remarkable performance in scenes (a) that exceed a range of $50\rm{m}$, (b) featuring dense objects with significant scale variations, but occasionally encounters challenges in scenes (c) that encompass distant, isolated objects.
  • ...and 3 more figures