Table of Contents
Fetching ...

Real-Time 3D Object Detection with Inference-Aligned Learning

Chenyu Zhao, Xianwei Zheng, Zimin Xia, Linwei Yue, Nan Xue

TL;DR

This work tackles the training–inference gap in real-time indoor 3D object detection from point clouds by introducing SR3D, a framework that couples Spatial-Prioritized Optimal Transport Assignment (SPOTA) with Rank-aware Adaptive Self-Distillation (RAS). SPOTA emphasizes spatial reliability by using a normalized vertex distance and a center-prior, redefining the label assignment cost to focus on geometry rather than semantics and selecting top-$k$ positives. RAS injects ranking information into training through a localization-aware self-distillation loss and adaptive weighting that aligns classification confidence with localization quality, addressing the rank sensitivity of evaluation metrics like AP. Together, SR3D improves accuracy on ScanNet V2 and SUN RGB-D while preserving real-time inference speed, validating the practicality of inference-aligned learning for dense 3D detectors and offering a pathway to extend to outdoor datasets and further speedups.

Abstract

Real-time 3D object detection from point clouds is essential for dynamic scene understanding in applications such as augmented reality, robotics and navigation. We introduce a novel Spatial-prioritized and Rank-aware 3D object detection (SR3D) framework for indoor point clouds, to bridge the gap between how detectors are trained and how they are evaluated. This gap stems from the lack of spatial reliability and ranking awareness during training, which conflicts with the ranking-based prediction selection used as inference. Such a training-inference gap hampers the model's ability to learn representations aligned with inference-time behavior. To address the limitation, SR3D consists of two components tailored to the spatial nature of point clouds during training: a novel spatial-prioritized optimal transport assignment that dynamically emphasizes well-located and spatially reliable samples, and a rank-aware adaptive self-distillation scheme that adaptively injects ranking perception via a self-distillation paradigm. Extensive experiments on ScanNet V2 and SUN RGB-D show that SR3D effectively bridges the training-inference gap and significantly outperforms prior methods in accuracy while maintaining real-time speed.

Real-Time 3D Object Detection with Inference-Aligned Learning

TL;DR

This work tackles the training–inference gap in real-time indoor 3D object detection from point clouds by introducing SR3D, a framework that couples Spatial-Prioritized Optimal Transport Assignment (SPOTA) with Rank-aware Adaptive Self-Distillation (RAS). SPOTA emphasizes spatial reliability by using a normalized vertex distance and a center-prior, redefining the label assignment cost to focus on geometry rather than semantics and selecting top- positives. RAS injects ranking information into training through a localization-aware self-distillation loss and adaptive weighting that aligns classification confidence with localization quality, addressing the rank sensitivity of evaluation metrics like AP. Together, SR3D improves accuracy on ScanNet V2 and SUN RGB-D while preserving real-time inference speed, validating the practicality of inference-aligned learning for dense 3D detectors and offering a pathway to extend to outdoor datasets and further speedups.

Abstract

Real-time 3D object detection from point clouds is essential for dynamic scene understanding in applications such as augmented reality, robotics and navigation. We introduce a novel Spatial-prioritized and Rank-aware 3D object detection (SR3D) framework for indoor point clouds, to bridge the gap between how detectors are trained and how they are evaluated. This gap stems from the lack of spatial reliability and ranking awareness during training, which conflicts with the ranking-based prediction selection used as inference. Such a training-inference gap hampers the model's ability to learn representations aligned with inference-time behavior. To address the limitation, SR3D consists of two components tailored to the spatial nature of point clouds during training: a novel spatial-prioritized optimal transport assignment that dynamically emphasizes well-located and spatially reliable samples, and a rank-aware adaptive self-distillation scheme that adaptively injects ranking perception via a self-distillation paradigm. Extensive experiments on ScanNet V2 and SUN RGB-D show that SR3D effectively bridges the training-inference gap and significantly outperforms prior methods in accuracy while maintaining real-time speed.

Paper Structure

This paper contains 35 sections, 10 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: AP$_{25}$ vs. latency on the ScanNet V2 validation set. Our proposed SR3D achieves accurate and fast detection from indoor point clouds. Latency is measured on a single RTX 4090 GPU. The metrics AP$_{25}$ is mean Average Precision under the IoU threshold of 0.25.
  • Figure 2: Illustration of core limitations in current dense 3D detectors. (a) Fixed heuristic label assignment misidentifies high-quality anchors for the desk, being misled by spatial clutter in the indoor scene. (b) Rank-agnostic supervision leads to incorrect ranking of chair predictions, degrading performance under Average Precision (AP) evaluation. We use 2D boxes for simplicity.
  • Figure 3: The overall framework of our Spatial-prioritized and Rank-aware network for indoor 3D object detection (SR3D). The spatial-prioritized OTA (SPOTA) and rank-aware adaptive self-distillation (RAS) scheme are employed only during training. SPOTA dynamically assigns positive labels to those truly informative and high-reliability anchors by leveraging geometry hints from prediction–ground truth pairs, such as the IoU and normalized vertex distances. RAS introduces ranking perception into training via a self-distillation mechanism and adaptively reweights the supervision based on relative ranking signals.
  • Figure 4: A simplified illustration of normalized vertex distances $\mathcal{R}_{VD}$. Dashed box indicates the smallest enclosing box. We use 2D boxes for simplicity.
  • Figure 5: Qualitative results on validation set of ScanNet V2. We only visualize the most confident predictions. As compared to TR3D, our SR3D enables robust detection of more challenging objects in cluttered scenes. Different classes are indicated by bounding boxes in different colors.
  • ...and 5 more figures