Table of Contents
Fetching ...

3D Small Object Detection with Dynamic Spatial Pruning

Xiuwei Xu, Zhihao Sun, Ziwei Wang, Hongmin Liu, Jie Zhou, Jiwen Lu

TL;DR

This paper tackles the challenge of 3D small object detection in indoor scenes where high-resolution features boost accuracy but incur prohibitive costs. It introduces DSPDet3D, a multi-level 3D detector that employs dynamic spatial pruning (DSP) to selectively prune decoder voxel features after object proposals are formed, guided by a theoretically derived pruning mask that preserves predictions. DSPDet3D combines a four-stage DSP decoder, a light-weight MLP pruning module supervised by a masking loss, partial addition fusion, and a training-time weak pruning mechanism to stabilize optimization and robustly assign positive proposals within a $P\times P\times P$ cube around object centers. Empirically, it achieves leading small-object AP on ScanNet and TO-SCENE, maintains competitive overall AP with superior memory efficiency, and generalizes to very large scenes (e.g., Matterport3D) where prior methods falter, enabling faster, scalable 3D perception on standard GPUs. The approach advances practical indoor 3D detection by balancing high-resolution representation with computational efficiency, facilitating real-time robotic and AR/VR applications.

Abstract

In this paper, we propose an efficient feature pruning strategy for 3D small object detection. Conventional 3D object detection methods struggle on small objects due to the weak geometric information from a small number of points. Although increasing the spatial resolution of feature representations can improve the detection performance on small objects, the additional computational overhead is unaffordable. With in-depth study, we observe the growth of computation mainly comes from the upsampling operation in the decoder of 3D detector. Motivated by this, we present a multi-level 3D detector named DSPDet3D which benefits from high spatial resolution to achieves high accuracy on small object detection, while reducing redundant computation by only focusing on small object areas. Specifically, we theoretically derive a dynamic spatial pruning (DSP) strategy to prune the redundant spatial representation of 3D scene in a cascade manner according to the distribution of objects. Then we design DSP module following this strategy and construct DSPDet3D with this efficient module. On ScanNet and TO-SCENE dataset, our method achieves leading performance on small object detection. Moreover, DSPDet3D trained with only ScanNet rooms can generalize well to scenes in larger scale. It takes less than 2s to directly process a whole building consisting of more than 4500k points while detecting out almost all objects, ranging from cups to beds, on a single RTX 3090 GPU. Project page: https://xuxw98.github.io/DSPDet3D/.

3D Small Object Detection with Dynamic Spatial Pruning

TL;DR

This paper tackles the challenge of 3D small object detection in indoor scenes where high-resolution features boost accuracy but incur prohibitive costs. It introduces DSPDet3D, a multi-level 3D detector that employs dynamic spatial pruning (DSP) to selectively prune decoder voxel features after object proposals are formed, guided by a theoretically derived pruning mask that preserves predictions. DSPDet3D combines a four-stage DSP decoder, a light-weight MLP pruning module supervised by a masking loss, partial addition fusion, and a training-time weak pruning mechanism to stabilize optimization and robustly assign positive proposals within a cube around object centers. Empirically, it achieves leading small-object AP on ScanNet and TO-SCENE, maintains competitive overall AP with superior memory efficiency, and generalizes to very large scenes (e.g., Matterport3D) where prior methods falter, enabling faster, scalable 3D perception on standard GPUs. The approach advances practical indoor 3D detection by balancing high-resolution representation with computational efficiency, facilitating real-time robotic and AR/VR applications.

Abstract

In this paper, we propose an efficient feature pruning strategy for 3D small object detection. Conventional 3D object detection methods struggle on small objects due to the weak geometric information from a small number of points. Although increasing the spatial resolution of feature representations can improve the detection performance on small objects, the additional computational overhead is unaffordable. With in-depth study, we observe the growth of computation mainly comes from the upsampling operation in the decoder of 3D detector. Motivated by this, we present a multi-level 3D detector named DSPDet3D which benefits from high spatial resolution to achieves high accuracy on small object detection, while reducing redundant computation by only focusing on small object areas. Specifically, we theoretically derive a dynamic spatial pruning (DSP) strategy to prune the redundant spatial representation of 3D scene in a cascade manner according to the distribution of objects. Then we design DSP module following this strategy and construct DSPDet3D with this efficient module. On ScanNet and TO-SCENE dataset, our method achieves leading performance on small object detection. Moreover, DSPDet3D trained with only ScanNet rooms can generalize well to scenes in larger scale. It takes less than 2s to directly process a whole building consisting of more than 4500k points while detecting out almost all objects, ranging from cups to beds, on a single RTX 3090 GPU. Project page: https://xuxw98.github.io/DSPDet3D/.
Paper Structure (15 sections, 8 equations, 10 figures, 7 tables)

This paper contains 15 sections, 8 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 2: Detection accuracy (mAP@0.25 of all categories) and speed (FPS) of mainstream 3D object detection methods on TO-SCENE dataset. Our DSPDet3D shows absolute advantage on 3D small object detection and provides flexible accuracy-speed tradeoff by simply adjusting the pruning threshold without retraining.
  • Figure 3: Comparison of the decoder in typical multi-level 3D object detector rukhovich2023tr3d and our DSPDet3D. Note that the sparsity of voxels in decoder is changed due to the generative upsampling operation. After detecting out objects in a level, DSPDet3D prunes redundant voxel features according to the distribution of objects before each upsampling operation. Red boxes indicate all pruned voxels and 'scissor' boxes indicate voxels pruned in the previous layer. $\{O\}$ is the set of all objects and $\{O_i\}$ is the set of objects assigned to level $i$.
  • Figure 4: The memory footprint distribution of different multi-level detectors. Layer 4 to Layer 1 refer to decoder layers (including detection heads) from coarse to fine. If doubling the spatial resolution of TR3D, the performance on 3D small object detection improves from 52.7% to 62.8% while memory footprint increases dramatically. We find decoder layers accounts for most of the costs. DSPDet3D efficiently reduces redundant computation on these layers, achieving both fast speed and high accuracy.
  • Figure 5: Illustration of DSPDet3D. The voxelized point clouds are fed into a high-resolution sparse convolutional backbone, which output four levels of scene representations. Four dynamic spatial pruning (DSP) modules are stacked to construct a multi-level decoder and detect objects from coarse to fine. DSP module utilizes a light-weight learnable module to predict the pruning mask. During inference, we discretize the pruning mask and use it to guide pruning before generative upsampling. While during training we interpolate the pruning mask to next level and prune the voxel features after generative upsampling.
  • Figure 6: Visualization of pruning process on ScanNet. We show the kept voxels in each level under different thresholds. The memory footprint of each level is also listed at bottom.
  • ...and 5 more figures