Table of Contents
Fetching ...

Not All Voxels Are Equal: Hardness-Aware Semantic Scene Completion with Self-Distillation

Song Wang, Jiawei Yu, Wentong Li, Wenyu Liu, Xiaolu Liu, Junbo Chen, Jianke Zhu

TL;DR

This paper tackles the inefficiency and uneven difficulty of 3D semantic scene completion by introducing HASSC, a hardness-aware training framework. It combines global hardness-based hard voxel mining with local geometric anisotropy to refine challenging voxels, and augments training with a self-distillation strategy using a mean-teacher model to stabilize learning. The approach yields consistent accuracy gains across state-of-the-art camera-based SSC baselines on SemanticKITTI while keeping inference cost unchanged, and shows robustness through extensive ablations. The findings suggest that targeted voxel-wise refinement guided by hardness signals can substantially improve 3D semantic occupancy predictions for autonomous driving, with potential future gains from integrating richer geometric cues such as NeRF-based cues.

Abstract

Semantic scene completion, also known as semantic occupancy prediction, can provide dense geometric and semantic information for autonomous vehicles, which attracts the increasing attention of both academia and industry. Unfortunately, existing methods usually formulate this task as a voxel-wise classification problem and treat each voxel equally in 3D space during training. As the hard voxels have not been paid enough attention, the performance in some challenging regions is limited. The 3D dense space typically contains a large number of empty voxels, which are easy to learn but require amounts of computation due to handling all the voxels uniformly for the existing models. Furthermore, the voxels in the boundary region are more challenging to differentiate than those in the interior. In this paper, we propose HASSC approach to train the semantic scene completion model with hardness-aware design. The global hardness from the network optimization process is defined for dynamical hard voxel selection. Then, the local hardness with geometric anisotropy is adopted for voxel-wise refinement. Besides, self-distillation strategy is introduced to make training process stable and consistent. Extensive experiments show that our HASSC scheme can effectively promote the accuracy of the baseline model without incurring the extra inference cost. Source code is available at: https://github.com/songw-zju/HASSC.

Not All Voxels Are Equal: Hardness-Aware Semantic Scene Completion with Self-Distillation

TL;DR

This paper tackles the inefficiency and uneven difficulty of 3D semantic scene completion by introducing HASSC, a hardness-aware training framework. It combines global hardness-based hard voxel mining with local geometric anisotropy to refine challenging voxels, and augments training with a self-distillation strategy using a mean-teacher model to stabilize learning. The approach yields consistent accuracy gains across state-of-the-art camera-based SSC baselines on SemanticKITTI while keeping inference cost unchanged, and shows robustness through extensive ablations. The findings suggest that targeted voxel-wise refinement guided by hardness signals can substantially improve 3D semantic occupancy predictions for autonomous driving, with potential future gains from integrating richer geometric cues such as NeRF-based cues.

Abstract

Semantic scene completion, also known as semantic occupancy prediction, can provide dense geometric and semantic information for autonomous vehicles, which attracts the increasing attention of both academia and industry. Unfortunately, existing methods usually formulate this task as a voxel-wise classification problem and treat each voxel equally in 3D space during training. As the hard voxels have not been paid enough attention, the performance in some challenging regions is limited. The 3D dense space typically contains a large number of empty voxels, which are easy to learn but require amounts of computation due to handling all the voxels uniformly for the existing models. Furthermore, the voxels in the boundary region are more challenging to differentiate than those in the interior. In this paper, we propose HASSC approach to train the semantic scene completion model with hardness-aware design. The global hardness from the network optimization process is defined for dynamical hard voxel selection. Then, the local hardness with geometric anisotropy is adopted for voxel-wise refinement. Besides, self-distillation strategy is introduced to make training process stable and consistent. Extensive experiments show that our HASSC scheme can effectively promote the accuracy of the baseline model without incurring the extra inference cost. Source code is available at: https://github.com/songw-zju/HASSC.
Paper Structure (19 sections, 10 equations, 10 figures, 15 tables)

This paper contains 19 sections, 10 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: Comparing our proposed hardness-aware semantic scene completion (HASSC) approach against previous semantic scene completion methods. We present an effective hard voxel mining (HVM) head with self-distillation during training.
  • Figure 2: Overview of Hardness-Aware Semantic Scene Completion (HASSC) pipeline. We take the camera images as input and construct 3D feature volume by Camera Encoder and 2D-to-3D Transform. With the fine-grained features provided by 3D Backbone, we propose hard voxel mining (HVM) head to make the model concentrate on hard voxels. The teacher-model has the same architecture as student-model, which is updated by the exponential moving average (EMA) of student. The stable predictions can be achieved by taking advantage of both the self-distillation and HVM head.
  • Figure 3: Illustration of Hard Voxel Mining (HVM) Head. At training stage, $N$ hard voxels are selected with respect to their global hardness and random sampling. Then, we re-sample the corresponding fine-grained features and employ MLP Layer to refine their predictions, which are supervised by the ground truth and local hardness. For inference, we directly utilize Trilinear Interpolation to obtain final prediction.
  • Figure 4: Illustration of Local Geometric Anisotropy (LGA). The upper figure gives the examples of different LGA values, which are all from real scenarios. The lower figure shows the distribution of LGA values in SemanticKITTI behley2019semantickitti. We use dark and light colors to represent the proportion of non-empty and empty voxels in each LGA value category, respectively.
  • Figure 5: Visual results of our method (HASSC-VoxFormer-T) and the state-of-the-art camera-based methods on the validation set of SemanticKITTI. The left shows the perspective view image from left camera, which is the input for model training and inference. The right is the ground truth and the corresponding predicted semantic scene from these methods.
  • ...and 5 more figures