Table of Contents
Fetching ...

SparseVoxFormer: Sparse Voxel-based Transformer for Multi-modal 3D Object Detection

Hyeongseok Son, Jia He, Seung-In Park, Ying Min, Yunhao Zhang, ByungIn Yoo

TL;DR

SparseVoxFormer reframes 3D multi-modal object detection by directly leveraging high-resolution sparse 3D voxel features instead of BEV representations. It achieves explicit LiDAR-camera fusion through projecting voxel coordinates into image space and concatenating matched features, while employing sparse feature refinement with DSVT and a token-elimination mechanism to maintain a fixed, low token count. The method demonstrates state-of-the-art results on nuScenes with lower computational cost and faster inference, due to the combination of sparse voxel processing, deep fusion, and efficient token management. By exploiting the sparsity inherent in LiDAR data, SparseVoxFormer preserves rich 3D geometry, improves long-range detection, and offers a scalable, efficient multi-modal 3D perception solution for autonomous driving.

Abstract

Most previous 3D object detection methods that leverage the multi-modality of LiDAR and cameras utilize the Bird's Eye View (BEV) space for intermediate feature representation. However, this space uses a low x, y-resolution and sacrifices z-axis information to reduce the overall feature resolution, which may result in declined accuracy. To tackle the problem of using low-resolution features, this paper focuses on the sparse nature of LiDAR point cloud data. From our observation, the number of occupied cells in the 3D voxels constructed from a LiDAR data can be even fewer than the number of total cells in the BEV map, despite the voxels' significantly higher resolution. Based on this, we introduce a novel sparse voxel-based transformer network for 3D object detection, dubbed as SparseVoxFormer. Instead of performing BEV feature extraction, we directly leverage sparse voxel features as the input for a transformer-based detector. Moreover, with regard to the camera modality, we introduce an explicit modality fusion approach that involves projecting 3D voxel coordinates onto 2D images and collecting the corresponding image features. Thanks to these components, our approach can leverage geometrically richer multi-modal features while even reducing the computational cost. Beyond the proof-of-concept level, we further focus on facilitating better multi-modal fusion and flexible control over the number of sparse features. Finally, thorough experimental results demonstrate that utilizing a significantly smaller number of sparse features drastically reduces computational costs in a 3D object detector while enhancing both overall and long-range performance.

SparseVoxFormer: Sparse Voxel-based Transformer for Multi-modal 3D Object Detection

TL;DR

SparseVoxFormer reframes 3D multi-modal object detection by directly leveraging high-resolution sparse 3D voxel features instead of BEV representations. It achieves explicit LiDAR-camera fusion through projecting voxel coordinates into image space and concatenating matched features, while employing sparse feature refinement with DSVT and a token-elimination mechanism to maintain a fixed, low token count. The method demonstrates state-of-the-art results on nuScenes with lower computational cost and faster inference, due to the combination of sparse voxel processing, deep fusion, and efficient token management. By exploiting the sparsity inherent in LiDAR data, SparseVoxFormer preserves rich 3D geometry, improves long-range detection, and offers a scalable, efficient multi-modal 3D perception solution for autonomous driving.

Abstract

Most previous 3D object detection methods that leverage the multi-modality of LiDAR and cameras utilize the Bird's Eye View (BEV) space for intermediate feature representation. However, this space uses a low x, y-resolution and sacrifices z-axis information to reduce the overall feature resolution, which may result in declined accuracy. To tackle the problem of using low-resolution features, this paper focuses on the sparse nature of LiDAR point cloud data. From our observation, the number of occupied cells in the 3D voxels constructed from a LiDAR data can be even fewer than the number of total cells in the BEV map, despite the voxels' significantly higher resolution. Based on this, we introduce a novel sparse voxel-based transformer network for 3D object detection, dubbed as SparseVoxFormer. Instead of performing BEV feature extraction, we directly leverage sparse voxel features as the input for a transformer-based detector. Moreover, with regard to the camera modality, we introduce an explicit modality fusion approach that involves projecting 3D voxel coordinates onto 2D images and collecting the corresponding image features. Thanks to these components, our approach can leverage geometrically richer multi-modal features while even reducing the computational cost. Beyond the proof-of-concept level, we further focus on facilitating better multi-modal fusion and flexible control over the number of sparse features. Finally, thorough experimental results demonstrate that utilizing a significantly smaller number of sparse features drastically reduces computational costs in a 3D object detector while enhancing both overall and long-range performance.

Paper Structure

This paper contains 38 sections, 8 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Architecture comparison between CMT yan2023cross and our SparseVoxformer.
  • Figure 2: A key idea to use sparse voxel features instead of BEV features. BEV features, obtained from lower-resolution and z-axis suppressed features, rather can produce a comparable number of tokens to that of higher-resolution voxel features.
  • Figure 3: Explicit multi-modal fusion without image depths is available in our voxel-based approach since each valid cell in 3D voxels already possess 3D coordinates, required for LiDAR to camera transformation. A LiDAR point can be easily projected to the camera space by a pre-defined LiDAR-camera transformation matrix. Similarly, each valid voxel feature has a corresponding projected image feature by the same transformation matrix.
  • Figure 4: Feature processing to produce BEV features in a LiDAR backbone in CMT yan2023cross. Our sparse voxel features are intermediate results in this process (in 3D sparse encoder). Voxelization, voxel layer, and voxel encoder do not contain learnable parameters.
  • Figure 5: Histogram of valid cell counts per LiDAR sample (10 sweeps) in the nuScenes train set, implying the distribution of the count of valid voxel features with the voxel resolution of $180\times180\times11$. The red arrow denotes the number of cells for BEV feature map with the resolution of $180\times180$. The average value (blue arrow) for the sparse voxel features is much smaller than the number of BEV features (red arrow), which is further reduced by our additional feature sparsification (green arrow).
  • ...and 2 more figures