Table of Contents
Fetching ...

Seg2Box: 3D Object Detection by Point-Wise Semantics Supervision

Maoji Zheng, Ziyu Xu, Qiming Xia, Hai Wu, Chenglu Wen, Cheng Wang

TL;DR

Seg2Box tackles the redundancy between bounding-box and semantic-label supervision by training detectors with only semantic annotations. It introduces MFMS-C to generate high-quality box-level pseudo-labels via multi-frame, multi-radius clustering and selects the best proposals with the MSF-Score, followed by SGIM-ST to iteratively refine labels and mine unlabeled instances through semantic-guided self-training. On Waymo Open and nuScenes, Seg2Box achieves substantial gains (e.g., mAP improvements of $23.7\%$ and $10.3\%$, respectively) and approaches 95% of fully supervised Vehicle AP at IoU $=0.5$, demonstrating strong label-efficient potential. The two-stage framework and its components enable robust cross-task supervision, suggesting practical pathways for reducing annotation costs in 3D scene understanding.

Abstract

LiDAR-based 3D object detection and semantic segmentation are critical tasks in 3D scene understanding. Traditional detection and segmentation methods supervise their models through bounding box labels and semantic mask labels. However, these two independent labels inherently contain significant redundancy. This paper aims to eliminate the redundancy by supervising 3D object detection using only semantic labels. However, the challenge arises due to the incomplete geometry structure and boundary ambiguity of point-cloud instances, leading to inaccurate pseudo labels and poor detection results. To address these challenges, we propose a novel method, named Seg2Box. We first introduce a Multi-Frame Multi-Scale Clustering (MFMS-C) module, which leverages the spatio-temporal consistency of point clouds to generate accurate box-level pseudo-labels. Additionally, the Semantic?Guiding Iterative-Mining Self-Training (SGIM-ST) module is proposed to enhance the performance by progressively refining the pseudo-labels and mining the instances without generating pseudo-labels. Experiments on the Waymo Open Dataset and nuScenes Dataset show that our method significantly outperforms other competitive methods by 23.7\% and 10.3\% in mAP, respectively. The results demonstrate the great label-efficient potential and advancement of our method.

Seg2Box: 3D Object Detection by Point-Wise Semantics Supervision

TL;DR

Seg2Box tackles the redundancy between bounding-box and semantic-label supervision by training detectors with only semantic annotations. It introduces MFMS-C to generate high-quality box-level pseudo-labels via multi-frame, multi-radius clustering and selects the best proposals with the MSF-Score, followed by SGIM-ST to iteratively refine labels and mine unlabeled instances through semantic-guided self-training. On Waymo Open and nuScenes, Seg2Box achieves substantial gains (e.g., mAP improvements of and , respectively) and approaches 95% of fully supervised Vehicle AP at IoU , demonstrating strong label-efficient potential. The two-stage framework and its components enable robust cross-task supervision, suggesting practical pathways for reducing annotation costs in 3D scene understanding.

Abstract

LiDAR-based 3D object detection and semantic segmentation are critical tasks in 3D scene understanding. Traditional detection and segmentation methods supervise their models through bounding box labels and semantic mask labels. However, these two independent labels inherently contain significant redundancy. This paper aims to eliminate the redundancy by supervising 3D object detection using only semantic labels. However, the challenge arises due to the incomplete geometry structure and boundary ambiguity of point-cloud instances, leading to inaccurate pseudo labels and poor detection results. To address these challenges, we propose a novel method, named Seg2Box. We first introduce a Multi-Frame Multi-Scale Clustering (MFMS-C) module, which leverages the spatio-temporal consistency of point clouds to generate accurate box-level pseudo-labels. Additionally, the Semantic?Guiding Iterative-Mining Self-Training (SGIM-ST) module is proposed to enhance the performance by progressively refining the pseudo-labels and mining the instances without generating pseudo-labels. Experiments on the Waymo Open Dataset and nuScenes Dataset show that our method significantly outperforms other competitive methods by 23.7\% and 10.3\% in mAP, respectively. The results demonstrate the great label-efficient potential and advancement of our method.

Paper Structure

This paper contains 31 sections, 6 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Our method uses only semantic annotation to train 3D object detection. It contains Pseudo-label Generation stage and Self-train Loop-improvement stage.
  • Figure 2: The challenges of generating box-level pseudo-labels from semantic labels. Blue boxes are the ground truth. And, black boxes are pseudo-labels generated by direct DBSCAN clustering. Points with different colors indicate different categories of objects. ①, ②: Incomplete objects due to sparse point cloud and occlusion. ③: Clustering one instance into multiple due to the truncated object. ④: Clustering multiple instances into one due to adjacent objects.
  • Figure 3: Illustration of Seg2Box framework. (a) MFMS-C generates box-level pseudo-labels from semantic points to train the initial detector $\mathcal{F}_{0}^{det}$. To address the challenges of pseudo-label generation due to incomplete geometry structure and boundary ambiguity, MFMS-C first generates numerous box proposals in consecutive frames using MSC. After that, NMS Selection remains the high-quality proposals depending on MSF-Scoring which measures the quality of pseudo-labels. (b) SGIM-ST enhances detection performance by iteratively mining the miss annotated instances and refining the pseudo-labels through SCF, STCF, BAF, and MSF-Weighted Loss.
  • Figure 4: Meta Shape and Fitting Score (MSF-Score).
  • Figure 5: (a-c): The IoU distribution between pseudo-labels and ground truth. (d-f): The mean absolute errors (MAEs) for size, position, and angle of pseudo-labels.
  • ...and 1 more figures