OccLE: Label-Efficient 3D Semantic Occupancy Prediction
Naiyu Fang, Zheyuan Zhou, Fayao Liu, Xulei Yang, Jiacheng Wei, Lemiao Qiu, Guosheng Lin
TL;DR
This paper tackles the high cost of voxel-level labeling in 3D semantic occupancy by proposing OccLE, a label-efficient framework that decouples semantic and geometric learning and fuses their features. It distills 2D foundation-model pseudo labels to supervise both 2D and 3D semantic learning, adopts a cross-plane, semi-supervised geometry module for efficient geometry learning, and uses a Dual Mamba fusion with scatter-accumulated projection to supervise unannotated regions. Three core contributions are the 2D pseudo-label distillation for semantic learning, cross-plane image-LiDAR geometry synergy with semi-supervision, and semantic-geometric fusion with aligned pseudo-label supervision. Experiments on SemanticKITTI and Occ3D-nuScenes show OccLE achieving 16.59% mIoU and 27.53 mIoU respectively with only 10% voxel annotations, highlighting strong label efficiency and competitiveness with fully supervised approaches, with significant practical impact for scalable 3D perception in autonomous systems.
Abstract
3D semantic occupancy prediction offers an intuitive and efficient scene understanding and has attracted significant interest in autonomous driving perception. Existing approaches either rely on full supervision, which demands costly voxel-level annotations, or on self-supervision, which provides limited guidance and yields suboptimal performance. To address these challenges, we propose OccLE, a Label-Efficient 3D Semantic Occupancy Prediction that takes images and LiDAR as inputs and maintains high performance with limited voxel annotations. Our intuition is to decouple the semantic and geometric learning tasks and then fuse the learned feature grids from both tasks for the final semantic occupancy prediction. Therefore, the semantic branch distills 2D foundation model to provide aligned pseudo labels for 2D and 3D semantic learning. The geometric branch integrates image and LiDAR inputs in cross-plane synergy based on their inherency, employing semi-supervision to enhance geometry learning. We fuse semantic-geometric feature grids through Dual Mamba and incorporate a scatter-accumulated projection to supervise unannotated prediction with aligned pseudo labels. Experiments show that OccLE achieves competitive performance with only 10\% of voxel annotations on the SemanticKITTI and Occ3D-nuScenes datasets. The code will be publicly released on https://github.com/NerdFNY/OccLE
