Table of Contents
Fetching ...

SHTOcc: Effective 3D Occupancy Prediction with Sparse Head and Tail Voxels

Qiucheng Yu, Yuan Xie, Xin Tan

TL;DR

The paper tackles inefficiencies and bias in vision-based 3D occupancy prediction by identifying inter-class long-tail and geometric distribution patterns in voxel space. It introduces SHTOcc, which combines sparse head-tail voxel construction with attention-guided head voxel selection and robust tail voxel sampling, plus a decoupled decoder with label smoothing to reduce head-class bias and boost tail-class accuracy. Empirical results across SemanticKITTI SSC, nuScenes-Occupancy, Occ3D-nuScenes, and LiDAR segmentation show substantial memory and latency reductions (e.g., up to ~58.6% faster inference and ~42.2% memory savings) and consistent mIoU gains (~0.2–0.7 points) when integrating SHTOcc with popular backbones. The approach is plug-and-play and offers practical improvements for real-time 3D perception in autonomous driving.

Abstract

3D occupancy prediction has attracted much attention in the field of autonomous driving due to its powerful geometric perception and object recognition capabilities. However, existing methods have not explored the most essential distribution patterns of voxels, resulting in unsatisfactory results. This paper first explores the inter-class distribution and geometric distribution of voxels, thereby solving the long-tail problem caused by the inter-class distribution and the poor performance caused by the geometric distribution. Specifically, this paper proposes SHTOcc (Sparse Head-Tail Occupancy), which uses sparse head-tail voxel construction to accurately identify and balance key voxels in the head and tail classes, while using decoupled learning to reduce the model's bias towards the dominant (head) category and enhance the focus on the tail class. Experiments show that significant improvements have been made on multiple baselines: SHTOcc reduces GPU memory usage by 42.2%, increases inference speed by 58.6%, and improves accuracy by about 7%, verifying its effectiveness and efficiency. The code is available at https://github.com/ge95net/SHTOcc

SHTOcc: Effective 3D Occupancy Prediction with Sparse Head and Tail Voxels

TL;DR

The paper tackles inefficiencies and bias in vision-based 3D occupancy prediction by identifying inter-class long-tail and geometric distribution patterns in voxel space. It introduces SHTOcc, which combines sparse head-tail voxel construction with attention-guided head voxel selection and robust tail voxel sampling, plus a decoupled decoder with label smoothing to reduce head-class bias and boost tail-class accuracy. Empirical results across SemanticKITTI SSC, nuScenes-Occupancy, Occ3D-nuScenes, and LiDAR segmentation show substantial memory and latency reductions (e.g., up to ~58.6% faster inference and ~42.2% memory savings) and consistent mIoU gains (~0.2–0.7 points) when integrating SHTOcc with popular backbones. The approach is plug-and-play and offers practical improvements for real-time 3D perception in autonomous driving.

Abstract

3D occupancy prediction has attracted much attention in the field of autonomous driving due to its powerful geometric perception and object recognition capabilities. However, existing methods have not explored the most essential distribution patterns of voxels, resulting in unsatisfactory results. This paper first explores the inter-class distribution and geometric distribution of voxels, thereby solving the long-tail problem caused by the inter-class distribution and the poor performance caused by the geometric distribution. Specifically, this paper proposes SHTOcc (Sparse Head-Tail Occupancy), which uses sparse head-tail voxel construction to accurately identify and balance key voxels in the head and tail classes, while using decoupled learning to reduce the model's bias towards the dominant (head) category and enhance the focus on the tail class. Experiments show that significant improvements have been made on multiple baselines: SHTOcc reduces GPU memory usage by 42.2%, increases inference speed by 58.6%, and improves accuracy by about 7%, verifying its effectiveness and efficiency. The code is available at https://github.com/ge95net/SHTOcc

Paper Structure

This paper contains 14 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparisons of the SHTOcc of various 3D semantic scene completion methods on the SemanticKITTI behley2019semantickitti dataset.
  • Figure 2: Overview of SHTOcc . The images featured are initially extracted by the image backbone and then convert to 3D encoded voxels through 2D to 3D transformation. The encoded voxel is extracted through dual path to obtain sparse Head and Tail voxels. The entire extraction process is a corase-to-fine process. With each additional layer, more voxels will be extracted accordingly. In the second learning phase of decouple training, only the parameters of segmentation head will be updated.
  • Figure 3: Visualization of SparseOcc tang2024sparseocc and SHTOcc. The figure demonstrates that SHTOcc can achieve more accurate predictions with lower computation cost.