Table of Contents
Fetching ...

OccLE: Label-Efficient 3D Semantic Occupancy Prediction

Naiyu Fang, Zheyuan Zhou, Fayao Liu, Xulei Yang, Jiacheng Wei, Lemiao Qiu, Guosheng Lin

TL;DR

This paper tackles the high cost of voxel-level labeling in 3D semantic occupancy by proposing OccLE, a label-efficient framework that decouples semantic and geometric learning and fuses their features. It distills 2D foundation-model pseudo labels to supervise both 2D and 3D semantic learning, adopts a cross-plane, semi-supervised geometry module for efficient geometry learning, and uses a Dual Mamba fusion with scatter-accumulated projection to supervise unannotated regions. Three core contributions are the 2D pseudo-label distillation for semantic learning, cross-plane image-LiDAR geometry synergy with semi-supervision, and semantic-geometric fusion with aligned pseudo-label supervision. Experiments on SemanticKITTI and Occ3D-nuScenes show OccLE achieving 16.59% mIoU and 27.53 mIoU respectively with only 10% voxel annotations, highlighting strong label efficiency and competitiveness with fully supervised approaches, with significant practical impact for scalable 3D perception in autonomous systems.

Abstract

3D semantic occupancy prediction offers an intuitive and efficient scene understanding and has attracted significant interest in autonomous driving perception. Existing approaches either rely on full supervision, which demands costly voxel-level annotations, or on self-supervision, which provides limited guidance and yields suboptimal performance. To address these challenges, we propose OccLE, a Label-Efficient 3D Semantic Occupancy Prediction that takes images and LiDAR as inputs and maintains high performance with limited voxel annotations. Our intuition is to decouple the semantic and geometric learning tasks and then fuse the learned feature grids from both tasks for the final semantic occupancy prediction. Therefore, the semantic branch distills 2D foundation model to provide aligned pseudo labels for 2D and 3D semantic learning. The geometric branch integrates image and LiDAR inputs in cross-plane synergy based on their inherency, employing semi-supervision to enhance geometry learning. We fuse semantic-geometric feature grids through Dual Mamba and incorporate a scatter-accumulated projection to supervise unannotated prediction with aligned pseudo labels. Experiments show that OccLE achieves competitive performance with only 10\% of voxel annotations on the SemanticKITTI and Occ3D-nuScenes datasets. The code will be publicly released on https://github.com/NerdFNY/OccLE

OccLE: Label-Efficient 3D Semantic Occupancy Prediction

TL;DR

This paper tackles the high cost of voxel-level labeling in 3D semantic occupancy by proposing OccLE, a label-efficient framework that decouples semantic and geometric learning and fuses their features. It distills 2D foundation-model pseudo labels to supervise both 2D and 3D semantic learning, adopts a cross-plane, semi-supervised geometry module for efficient geometry learning, and uses a Dual Mamba fusion with scatter-accumulated projection to supervise unannotated regions. Three core contributions are the 2D pseudo-label distillation for semantic learning, cross-plane image-LiDAR geometry synergy with semi-supervision, and semantic-geometric fusion with aligned pseudo-label supervision. Experiments on SemanticKITTI and Occ3D-nuScenes show OccLE achieving 16.59% mIoU and 27.53 mIoU respectively with only 10% voxel annotations, highlighting strong label efficiency and competitiveness with fully supervised approaches, with significant practical impact for scalable 3D perception in autonomous systems.

Abstract

3D semantic occupancy prediction offers an intuitive and efficient scene understanding and has attracted significant interest in autonomous driving perception. Existing approaches either rely on full supervision, which demands costly voxel-level annotations, or on self-supervision, which provides limited guidance and yields suboptimal performance. To address these challenges, we propose OccLE, a Label-Efficient 3D Semantic Occupancy Prediction that takes images and LiDAR as inputs and maintains high performance with limited voxel annotations. Our intuition is to decouple the semantic and geometric learning tasks and then fuse the learned feature grids from both tasks for the final semantic occupancy prediction. Therefore, the semantic branch distills 2D foundation model to provide aligned pseudo labels for 2D and 3D semantic learning. The geometric branch integrates image and LiDAR inputs in cross-plane synergy based on their inherency, employing semi-supervision to enhance geometry learning. We fuse semantic-geometric feature grids through Dual Mamba and incorporate a scatter-accumulated projection to supervise unannotated prediction with aligned pseudo labels. Experiments show that OccLE achieves competitive performance with only 10\% of voxel annotations on the SemanticKITTI and Occ3D-nuScenes datasets. The code will be publicly released on https://github.com/NerdFNY/OccLE

Paper Structure

This paper contains 34 sections, 6 equations, 6 figures, 16 tables, 1 algorithm.

Figures (6)

  • Figure 1: Label-efficient 3D semantic occupancy prediction aims to achieve high performance using limited voxel annotations and aligned pseudo label. We propose OccLE, a novel learning paradigm that decouples semantic and geometric learning and fuse their feature grids for the final prediction.
  • Figure 2: The overview of OccLE. First, we distill 2D foundation models to predict aligned pseudo labels for supervising 2D and 3D semantic learning. Next, we propose cross‐plane image and LiDAR feature synergy and apply semi‐supervision to learn geometry. Finally, we fuse semantic and geometric feature grids via Dual Mamba and supervise the unanotated prediction with aligned pseudo label using scatter‐accumulated projection.
  • Figure 3: Illustration of geometry learning. (a) Frontal view feature comparison. (b) BEV view feature comparison. (c) The cross-plane image and LiDAR feature synergy.
  • Figure 4: Qualitative results on the SemanticKITTI validation set. OccF., VoxF., SGN, Sym., and HASSC represent the prediction results from zhang2023occformer, li2023voxformer, mei2024camera, jiang2024symphonize, and wang2024not, respectively. GT denotes the ground truth.
  • Figure A1: The detailed structure of Dual Mamba. It comprises four stages and employs two parallel branches to process the inputs at each stage.
  • ...and 1 more figures