Table of Contents
Fetching ...

SparseDet: A Simple and Effective Framework for Fully Sparse LiDAR-based 3D Object Detection

Lin Liu, Ziying Song, Qiming Xia, Feiyang Jia, Caiyan Jia, Lei Yang, Hongyu Pan

TL;DR

SparseDet tackles the limitation of using single central voxels or clustered foreground proxies for LiDAR-based 3D detection by introducing sparse queries as object proxies. It simultaneously aggregates local multi-scale context via LMFA and global scene context via GFA, enabled by a KD-tree-based neighborhood fusion and scale-adaptive self-attention. The method delivers state-of-the-art performance among fully sparse detectors on nuScenes and KITTI while maintaining high FPS and modest parameter overhead. These results demonstrate that careful context aggregation within a fully sparse framework can significantly improve object proxy expressiveness and long-range detection in autonomous driving scenarios.

Abstract

LiDAR-based sparse 3D object detection plays a crucial role in autonomous driving applications due to its computational efficiency advantages. Existing methods either use the features of a single central voxel as an object proxy, or treat an aggregated cluster of foreground points as an object proxy. However, the former lacks the ability to aggregate contextual information, resulting in insufficient information expression in object proxies. The latter relies on multi-stage pipelines and auxiliary tasks, which reduce the inference speed. To maintain the efficiency of the sparse framework while fully aggregating contextual information, in this work, we propose SparseDet which designs sparse queries as object proxies. It introduces two key modules, the Local Multi-scale Feature Aggregation (LMFA) module and the Global Feature Aggregation (GFA) module, aiming to fully capture the contextual information, thereby enhancing the ability of the proxies to represent objects. Where LMFA sub-module achieves feature fusion across different scales for sparse key voxels %which does this through via coordinate transformations and using nearest neighbor relationships to capture object-level details and local contextual information, GFA sub-module uses self-attention mechanisms to selectively aggregate the features of the key voxels across the entire scene for capturing scene-level contextual information. Experiments on nuScenes and KITTI demonstrate the effectiveness of our method. Specifically, on nuScene, SparseDet surpasses the previous best sparse detector VoxelNeXt by 2.2\% mAP with 13.5 FPS, and on KITTI, it surpasses VoxelNeXt by 1.12\% $\mathbf{AP_{3D}}$ on hard level tasks with 17.9 FPS.

SparseDet: A Simple and Effective Framework for Fully Sparse LiDAR-based 3D Object Detection

TL;DR

SparseDet tackles the limitation of using single central voxels or clustered foreground proxies for LiDAR-based 3D detection by introducing sparse queries as object proxies. It simultaneously aggregates local multi-scale context via LMFA and global scene context via GFA, enabled by a KD-tree-based neighborhood fusion and scale-adaptive self-attention. The method delivers state-of-the-art performance among fully sparse detectors on nuScenes and KITTI while maintaining high FPS and modest parameter overhead. These results demonstrate that careful context aggregation within a fully sparse framework can significantly improve object proxy expressiveness and long-range detection in autonomous driving scenarios.

Abstract

LiDAR-based sparse 3D object detection plays a crucial role in autonomous driving applications due to its computational efficiency advantages. Existing methods either use the features of a single central voxel as an object proxy, or treat an aggregated cluster of foreground points as an object proxy. However, the former lacks the ability to aggregate contextual information, resulting in insufficient information expression in object proxies. The latter relies on multi-stage pipelines and auxiliary tasks, which reduce the inference speed. To maintain the efficiency of the sparse framework while fully aggregating contextual information, in this work, we propose SparseDet which designs sparse queries as object proxies. It introduces two key modules, the Local Multi-scale Feature Aggregation (LMFA) module and the Global Feature Aggregation (GFA) module, aiming to fully capture the contextual information, thereby enhancing the ability of the proxies to represent objects. Where LMFA sub-module achieves feature fusion across different scales for sparse key voxels %which does this through via coordinate transformations and using nearest neighbor relationships to capture object-level details and local contextual information, GFA sub-module uses self-attention mechanisms to selectively aggregate the features of the key voxels across the entire scene for capturing scene-level contextual information. Experiments on nuScenes and KITTI demonstrate the effectiveness of our method. Specifically, on nuScene, SparseDet surpasses the previous best sparse detector VoxelNeXt by 2.2\% mAP with 13.5 FPS, and on KITTI, it surpasses VoxelNeXt by 1.12\% on hard level tasks with 17.9 FPS.
Paper Structure (28 sections, 7 equations, 6 figures, 12 tables, 1 algorithm)

This paper contains 28 sections, 7 equations, 6 figures, 12 tables, 1 algorithm.

Figures (6)

  • Figure 1: The comparison of SparseDet with existing LiDAR-based detectors on nuScenesnuscenes test dataset, where the vertical axis represents mAP, and the horizontal axis represents model inference speed (FPS). Compared to other sparse detectors, SparseDet achieves the highest mAP while maintaining an excellent inference speed.
  • Figure 2: Comparison between SparseDet and other sparse detection frameworks FSDFSDV2voxelnextSAFDNet. (a) The first category methods, VoxelNeXt voxelnext and SAFDNet SAFDNet, diffuse feature by stacking convolutional layers to fill center voxels. However, utilizing only a single central voxel feature as a proxy for an object neglects an amount of instance features, thereby weakening the ability to represent objects based on their central voxels. (b) The second category method, FSD FSD and FSDV2 FSDV2, utilize a voting mechanism to aggregate foreground points into object-centered clusters for further prediction. However, these methods overly relies on point segmentation and prediction refinement which results in time delays. (c) Our SparseDet addresses the issue of insufficient information representation in central voxel features by utilizing sparse queries as object proxies and selectively aggregating sparse voxel features at interested positions, avoiding the need for additional auxiliary tasks.
  • Figure 3: The framework of our SparseDet. First, we voxelizes point clouds and feed voxels into a 3D sparse convolution backbone. Then, we perform high compression voxelnext on sparse voxel features ($F_{s_{4}}$, $F_{s_{5}}$, $F_{s_{6}}$) of the last three layers in the 3D backbone to obtain 2D sparse features of these three layers, denoted as $F_{s_{4}}^{2D}$, $F_{s_{5}}^{2D}$ and $F_{s_{6}}^{2D}$. In LMFA, we concat $F_{s_{4}}$, $F_{s_{5}}$ and $F_{s_{6}}$ to obtain $F_{Fusion}$. After applying high compression voxelnext on $F_{Fusion}$, we have $F_{Fusion}^{2D}$ and then perform key voxel position prediction using a heatmap. Through coordinate transformation, we convert the key voxel features to the spaces of $F_{s_{4}}$, $F_{s_{5}}$ and $F_{s_{6}}$ and aggregate the neighborhood voxel context based on $K$ Nearest Neighbor (KNN) relationships. Subsequently, we replace the aggregated voxel features back into $F_{Fusion}$ based on the indices to enhance the feature representation capability of sparse features. In GFA module, we utilizes sparse voxels as queries to adaptively aggregate global sparse voxel features, where scale-adaptive weight map allows queries to autonomously learn the receptive field for aggregating information from relevant positions. At last, the aggregated queries are fed into FFN for result prediction. Adaption Fusion means the adaptive fusion of multi-scale features and FFN is a Feed Forward Neural Network.
  • Figure 4: The architecture details of LMFA.The LMFA module performs heatmap prediction from the $F_{Fusion}^{2D}$, and selects the top-$N_{key}$ high-scoring voxels as key voxels. It then converts the position information of the key voxels to their original feature spaces according to the downsampling ratios. Based on the position information of the key voxels, LMFA module finds the KNN voxel features and fuses them by using $Conv_{1*1}$. Finally, the neighborhood features from multi-scale feature spaces are fed into the adaptive-fusion module for adaptive fusion.
  • Figure 5: The architecture of Adaptive Fusion, which adaptively weights and fuses neighborhood features ($F_{Neighbor^{'}}^{s_{4}}$, $F_{Neighbor^{'}}^{s_{5}}$ and $F_{Neighbor^{'}}^{s_{6}}$) from multi-scale feature spaces for each key voxel.
  • ...and 1 more figures