Table of Contents
Fetching ...

Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance

Duc-Hai Pham, Duc-Dung Nguyen, Anh Pham, Tuan Ho, Phong Nguyen, Khoi Nguyen, Rang Nguyen

TL;DR

This work addresses the high annotation burden of 3D semantic scene completion (SSC) by introducing a semi-supervised framework that harnesses 2D vision foundation models to extract 3D cues from unlabeled images. The VFG-SSC network generates 3D clues from temporal 2D data, and a light, attention-based enhancement module fuses these cues with 3D features to produce high-quality pseudo-labels for unlabeled data. Across outdoor and indoor benchmarks, the approach achieves up to 85% of fully supervised performance using only 10% of labeled data, and generalizes across multiple SSC backbones, offering a practical, scalable path toward camera-based 3D occupancy prediction. The method reduces annotation costs while maintaining strong 3D geometry and semantic completion, with potential broad impact on autonomous perception pipelines.

Abstract

Accurate prediction of 3D semantic occupancy from 2D visual images is vital in enabling autonomous agents to comprehend their surroundings for planning and navigation. State-of-the-art methods typically employ fully supervised approaches, necessitating a huge labeled dataset acquired through expensive LiDAR sensors and meticulous voxel-wise labeling by human annotators. The resource-intensive nature of this annotating process significantly hampers the application and scalability of these methods. We introduce a novel semi-supervised framework to alleviate the dependency on densely annotated data. Our approach leverages 2D foundation models to generate essential 3D scene geometric and semantic cues, facilitating a more efficient training process. Our framework exhibits notable properties: (1) Generalizability, applicable to various 3D semantic scene completion approaches, including 2D-3D lifting and 3D-2D transformer methods. (2) Effectiveness, as demonstrated through experiments on SemanticKITTI and NYUv2, wherein our method achieves up to 85% of the fully-supervised performance using only 10% labeled data. This approach not only reduces the cost and labor associated with data annotation but also demonstrates the potential for broader adoption in camera-based systems for 3D semantic occupancy prediction.

Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance

TL;DR

This work addresses the high annotation burden of 3D semantic scene completion (SSC) by introducing a semi-supervised framework that harnesses 2D vision foundation models to extract 3D cues from unlabeled images. The VFG-SSC network generates 3D clues from temporal 2D data, and a light, attention-based enhancement module fuses these cues with 3D features to produce high-quality pseudo-labels for unlabeled data. Across outdoor and indoor benchmarks, the approach achieves up to 85% of fully supervised performance using only 10% of labeled data, and generalizes across multiple SSC backbones, offering a practical, scalable path toward camera-based 3D occupancy prediction. The method reduces annotation costs while maintaining strong 3D geometry and semantic completion, with potential broad impact on autonomous perception pipelines.

Abstract

Accurate prediction of 3D semantic occupancy from 2D visual images is vital in enabling autonomous agents to comprehend their surroundings for planning and navigation. State-of-the-art methods typically employ fully supervised approaches, necessitating a huge labeled dataset acquired through expensive LiDAR sensors and meticulous voxel-wise labeling by human annotators. The resource-intensive nature of this annotating process significantly hampers the application and scalability of these methods. We introduce a novel semi-supervised framework to alleviate the dependency on densely annotated data. Our approach leverages 2D foundation models to generate essential 3D scene geometric and semantic cues, facilitating a more efficient training process. Our framework exhibits notable properties: (1) Generalizability, applicable to various 3D semantic scene completion approaches, including 2D-3D lifting and 3D-2D transformer methods. (2) Effectiveness, as demonstrated through experiments on SemanticKITTI and NYUv2, wherein our method achieves up to 85% of the fully-supervised performance using only 10% labeled data. This approach not only reduces the cost and labor associated with data annotation but also demonstrates the potential for broader adoption in camera-based systems for 3D semantic occupancy prediction.
Paper Structure (16 sections, 13 figures, 14 tables)

This paper contains 16 sections, 13 figures, 14 tables.

Figures (13)

  • Figure 1: Visual comparison on SemanticKITTI: VFG-SSC surparses self-supervised methods like SelfOcc by 4% mIoU and outperforms other semi-supervised methods: Self-Training and Mean Teacher by 1.5% mIoU.
  • Figure 2: The overall architecture of the proposed VFG-SSC network. Our approach leverages 3D cues from 2D foundation models to enhance the inferred 3D feature volume and generate the final 3D semantic occupancy grid. The model is trained using all frames (solid red box) to produce pseudo-labels for the unlabeled data.
  • Figure 3: Qualitative of Semi-SSC approaches. With Sup-only and Self-Training, the scene layout is reasonably reconstructed, however, dynamic object prediction is incorrect due to many false positives. With 3D clues as pseudo-labels, objects are correctly predicted but occluded regions do not have a prediction (assigned as empty). Our method obtains reliable predictions (high precision) and reasonable reconstruction in occluded space (high recall).
  • Figure 4: 3D clues visual comparisons on different settings: using only the current image and accumulated 3D clues from temporal frames with and without filtering.
  • Figure 5: Qualitative results on the validation set of the SemanticKITTI SemanticKITTI dataset.
  • ...and 8 more figures