Table of Contents
Fetching ...

MinkOcc: Towards real-time label-efficient semantic occupancy prediction

Samuel Sze, Daniele De Martini, Lars Kunze

TL;DR

MinkOcc tackles the high annotation burden of 3D semantic occupancy by a two-stage semi-supervised pipeline that first warm-starts with a small dense 3D annotation set and then continues training using accumulated LiDAR sweeps and 2D pseudo-labels from vision foundation models. The approach combines a fully sparse multi-modal backbone with Minkowski Engine, a differentiable spherical renderer (Pulsar) for 2D supervision, and a two-phase loss design that shifts from dense 3D supervision to 2D pseudo-label supervision while performing real-time inference. Key contributions include (i) a scalable, sparse, multi-modal 3D semantic occupancy model capable of real-time performance, (ii) a semi-supervised training strategy that substantially reduces dense 3D labeling needs, and (iii) effective integration of 2D pseudo-labels and LiDAR accumulation to supervise both occupancy and semantics. The work demonstrates that semi-supervised learning can enable practical deployment of 3D semantic occupancy in autonomous driving beyond curated datasets, maintaining competitive accuracy with significantly reduced labeling and computation costs.

Abstract

Developing 3D semantic occupancy prediction models often relies on dense 3D annotations for supervised learning, a process that is both labor and resource-intensive, underscoring the need for label-efficient or even label-free approaches. To address this, we introduce MinkOcc, a multi-modal 3D semantic occupancy prediction framework for cameras and LiDARs that proposes a two-step semi-supervised training procedure. Here, a small dataset of explicitly 3D annotations warm-starts the training process; then, the supervision is continued by simpler-to-annotate accumulated LiDAR sweeps and images -- semantically labelled through vision foundational models. MinkOcc effectively utilizes these sensor-rich supervisory cues and reduces reliance on manual labeling by 90\% while maintaining competitive accuracy. In addition, the proposed model incorporates information from LiDAR and camera data through early fusion and leverages sparse convolution networks for real-time prediction. With its efficiency in both supervision and computation, we aim to extend MinkOcc beyond curated datasets, enabling broader real-world deployment of 3D semantic occupancy prediction in autonomous driving.

MinkOcc: Towards real-time label-efficient semantic occupancy prediction

TL;DR

MinkOcc tackles the high annotation burden of 3D semantic occupancy by a two-stage semi-supervised pipeline that first warm-starts with a small dense 3D annotation set and then continues training using accumulated LiDAR sweeps and 2D pseudo-labels from vision foundation models. The approach combines a fully sparse multi-modal backbone with Minkowski Engine, a differentiable spherical renderer (Pulsar) for 2D supervision, and a two-phase loss design that shifts from dense 3D supervision to 2D pseudo-label supervision while performing real-time inference. Key contributions include (i) a scalable, sparse, multi-modal 3D semantic occupancy model capable of real-time performance, (ii) a semi-supervised training strategy that substantially reduces dense 3D labeling needs, and (iii) effective integration of 2D pseudo-labels and LiDAR accumulation to supervise both occupancy and semantics. The work demonstrates that semi-supervised learning can enable practical deployment of 3D semantic occupancy in autonomous driving beyond curated datasets, maintaining competitive accuracy with significantly reduced labeling and computation costs.

Abstract

Developing 3D semantic occupancy prediction models often relies on dense 3D annotations for supervised learning, a process that is both labor and resource-intensive, underscoring the need for label-efficient or even label-free approaches. To address this, we introduce MinkOcc, a multi-modal 3D semantic occupancy prediction framework for cameras and LiDARs that proposes a two-step semi-supervised training procedure. Here, a small dataset of explicitly 3D annotations warm-starts the training process; then, the supervision is continued by simpler-to-annotate accumulated LiDAR sweeps and images -- semantically labelled through vision foundational models. MinkOcc effectively utilizes these sensor-rich supervisory cues and reduces reliance on manual labeling by 90\% while maintaining competitive accuracy. In addition, the proposed model incorporates information from LiDAR and camera data through early fusion and leverages sparse convolution networks for real-time prediction. With its efficiency in both supervision and computation, we aim to extend MinkOcc beyond curated datasets, enabling broader real-world deployment of 3D semantic occupancy prediction in autonomous driving.

Paper Structure

This paper contains 19 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: mIoU of occupancy prediction methods using different supervision signals on Occ3D-nuScenes occ3d. Strong-supervision yields the highest mIoU, but its labeling cost is prohibitive. Self-supervision avoids labels but suffers low accuracy. Semi-supervision, where our proposed MinkOcc belongs, offers a practical alternative to reduce labeling costs while maintaining accuracy.
  • Figure 2: Overview of system pipeline. Our model predicts dense, 3D semantic occupancy maps from LiDAR and camera information. It is trained in two steps. First, we warm-start the prediction model through $\alpha = 10 \%$ of dense 3D semantic annotations from Occ3D-nuScenes; then, the voxel semantic prediction branch is turned off, and cheaper LiDAR accumulated sweeps and image semantic maps replace dense annotations. The supervision of the images is provided through a differentiable rendering approach, which projects the semantic information in the camera frames. 3D sparse and dense representations are converted to Coordinate List Format (COO) to ensure compatibility with Minkowski Engine choy20194d.
  • Figure 3: Comparison of MinkOcc-semi's predicted 3D feature volume rendered in 2D against 2D pseudo-label. Results are from Occ3D-nuScenes validation set. Better viewed in color and zoomed in.
  • Figure 4: Qualitative results on Occ3D-nuScenes validation across different MinkOcc warm-start phase percentage setup against FB-Occ nvocc and Ground Truth. Better viewed in color and zoomed in.