Table of Contents
Fetching ...

Hardness-Aware Scene Synthesis for Semi-Supervised 3D Object Detection

Shuai Zeng, Wenzhao Zheng, Jiwen Lu, Haibin Yan

TL;DR

This paper tackles the data-hungry nature of 3D object detection by introducing hardness-aware scene synthesis (HASS), which uses an online pseudo-database of pseudo-labels to synthesize diverse scenes by composing unlabeled foreground objects with labeled backgrounds. The approach employs a two-stage synthesis (easy and hard) and a dynamic pseudo-database that gradually shifts from high to low filtering thresholds and from sparse to dense object insertion, enabling progressive hardening of training data. Key contributions include leveraging pseudo-labels from a trained teacher for scene synthesis, a sparse-to-dense synthesis strategy, and extensive ablations showing improved generalization on KITTI and Waymo with limited labels, all without extra inference overhead. The work demonstrates that carefully controlled synthetic scene generation, guided by pseudo-label quality and curriculum-like density, can substantially boost semi-supervised 3D detection performance in autonomous driving settings.

Abstract

3D object detection aims to recover the 3D information of concerning objects and serves as the fundamental task of autonomous driving perception. Its performance greatly depends on the scale of labeled training data, yet it is costly to obtain high-quality annotations for point cloud data. While conventional methods focus on generating pseudo-labels for unlabeled samples as supplements for training, the structural nature of 3D point cloud data facilitates the composition of objects and backgrounds to synthesize realistic scenes. Motivated by this, we propose a hardness-aware scene synthesis (HASS) method to generate adaptive synthetic scenes to improve the generalization of the detection models. We obtain pseudo-labels for unlabeled objects and generate diverse scenes with different compositions of objects and backgrounds. As the scene synthesis is sensitive to the quality of pseudo-labels, we further propose a hardness-aware strategy to reduce the effect of low-quality pseudo-labels and maintain a dynamic pseudo-database to ensure the diversity and quality of synthetic scenes. Extensive experimental results on the widely used KITTI and Waymo datasets demonstrate the superiority of the proposed HASS method, which outperforms existing semi-supervised learning methods on 3D object detection. Code: https://github.com/wzzheng/HASS.

Hardness-Aware Scene Synthesis for Semi-Supervised 3D Object Detection

TL;DR

This paper tackles the data-hungry nature of 3D object detection by introducing hardness-aware scene synthesis (HASS), which uses an online pseudo-database of pseudo-labels to synthesize diverse scenes by composing unlabeled foreground objects with labeled backgrounds. The approach employs a two-stage synthesis (easy and hard) and a dynamic pseudo-database that gradually shifts from high to low filtering thresholds and from sparse to dense object insertion, enabling progressive hardening of training data. Key contributions include leveraging pseudo-labels from a trained teacher for scene synthesis, a sparse-to-dense synthesis strategy, and extensive ablations showing improved generalization on KITTI and Waymo with limited labels, all without extra inference overhead. The work demonstrates that carefully controlled synthetic scene generation, guided by pseudo-label quality and curriculum-like density, can substantially boost semi-supervised 3D detection performance in autonomous driving settings.

Abstract

3D object detection aims to recover the 3D information of concerning objects and serves as the fundamental task of autonomous driving perception. Its performance greatly depends on the scale of labeled training data, yet it is costly to obtain high-quality annotations for point cloud data. While conventional methods focus on generating pseudo-labels for unlabeled samples as supplements for training, the structural nature of 3D point cloud data facilitates the composition of objects and backgrounds to synthesize realistic scenes. Motivated by this, we propose a hardness-aware scene synthesis (HASS) method to generate adaptive synthetic scenes to improve the generalization of the detection models. We obtain pseudo-labels for unlabeled objects and generate diverse scenes with different compositions of objects and backgrounds. As the scene synthesis is sensitive to the quality of pseudo-labels, we further propose a hardness-aware strategy to reduce the effect of low-quality pseudo-labels and maintain a dynamic pseudo-database to ensure the diversity and quality of synthetic scenes. Extensive experimental results on the widely used KITTI and Waymo datasets demonstrate the superiority of the proposed HASS method, which outperforms existing semi-supervised learning methods on 3D object detection. Code: https://github.com/wzzheng/HASS.
Paper Structure (16 sections, 4 equations, 9 figures, 6 tables)

This paper contains 16 sections, 4 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: An illustration of constructing a synthetic 2D image sample and a 3D LiDAR sample. We exemplify the difficulty of constructing a 2D image and a 3D scene with objects from unlabeled data. It is difficult to synthesize a realistic 2D image yet easy to synthesize unlabeled objects at different positions to generate a diverse 3D sample to extend the laser beam distribution.
  • Figure 2: Overview of the HASS framework. The proposed architecture consists of two stages: (a) Easy Synthesis and (b) Hard Synthesis. We use blue arrows and light red dashed arrows to represent the easy-synthesis stage and hard-synthesis stage respectively. In the easy-synthesis stage, we only synthesize ground truth objects (light blue arrow), where the database is generated offline before training and contains only ground truth. As the training proceeds, the model is more tolerant of hard pseudo-labels. We maintain a pseudo-database and update the pseudo-database with the appropriate pseudo-labels in the hard-synthesis stage (light pink dashed arrow). The green arrow represents the dense synthesis strategy for the easy-synthesis stage, and the red dashed arrow represents the sparse to dense synthesis strategy for the hard-synthesis stage.
  • Figure 3: An illustration of the proposed (a) Scene Synthesis and (b) the Flip, (c) PointCutMix data augmentation methods. (a): Blue bounding boxes represent existing ground truth boxes, and red bounding boxes represent synthetic objects. Scene synthesis is to synthesize the foreground pseudo-labels from the pseudo-database on the labeled point clouds, where the synthetic scenes contain abundant unseen foreground information. We avoid object collision to synthesize the correct scenes. (b): Flip augmentation is the straightforward process of flipping the point cloud to obtain similar data. (c): PointCutMix involves replacing parts of its own point cloud with point clouds from other scenes. Point clouds of different colors represent data from different scenes.
  • Figure 4: The visualization of detection with different models. Blue boxes represent ground truth boxes and red boxes represent pseudo-labels. We visualize the detection of teacher models generated in different epochs on the same sample. (a) shows the low recall rate of the original teacher model at the beginning of training. The orange circle points out a false pseudo-label: a pedestrian identified as a cyclist. (b) shows that the recall rate of the trained teacher model is higher than the original model. The trained model predicts hard samples better than the original model.
  • Figure 5: The visualization of the relationship between IoU and confidence of pseudo-labels.
  • ...and 4 more figures