Table of Contents
Fetching ...

VOIC: Visible-Occluded Decoupling for Monocular 3D Semantic Scene Completion

Zaidao Han, Risa Higashita, Jiang Liu

TL;DR

This work tackles monocular 3D Semantic Scene Completion by explicitly separating visible-region perception from occluded-region reasoning. It introduces VRLE to provide clean, offline supervision for visible voxels and a dual-decoder architecture (VD for visible priors and OD for full-scene completion) that interact bidirectionally to produce coherent 3D semantics and geometry. The Visible Embedding Feature Constructor (VEFC) and multi-level positional encoding strengthen cross-modal grounding, enabling precise 3D lifting from 2D features and robust voxel interaction. Experiments on SemanticKITTI and SSCBench-KITTI-360 demonstrate state-of-the-art geometric completion and semantic accuracy with competitive efficiency, highlighting the practical impact for autonomous driving and robotic scene understanding.

Abstract

Camera-based 3D Semantic Scene Completion (SSC) is a critical task for autonomous driving and robotic scene understanding. It aims to infer a complete 3D volumetric representation of both semantics and geometry from a single image. Existing methods typically focus on end-to-end 2D-to-3D feature lifting and voxel completion. However, they often overlook the interference between high-confidence visible-region perception and low-confidence occluded-region reasoning caused by single-image input, which can lead to feature dilution and error propagation. To address these challenges, we introduce an offline Visible Region Label Extraction (VRLE) strategy that explicitly separates and extracts voxel-level supervision for visible regions from dense 3D ground truth. This strategy purifies the supervisory space for two complementary sub-tasks: visible-region perception and occluded-region reasoning. Building on this idea, we propose the Visible-Occluded Interactive Completion Network (VOIC), a novel dual-decoder framework that explicitly decouples SSC into visible-region semantic perception and occluded-region scene completion. VOIC first constructs a base 3D voxel representation by fusing image features with depth-derived occupancy. The visible decoder focuses on generating high-fidelity geometric and semantic priors, while the occlusion decoder leverages these priors together with cross-modal interaction to perform coherent global scene reasoning. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that VOIC outperforms existing monocular SSC methods in both geometric completion and semantic segmentation accuracy, achieving state-of-the-art performance.

VOIC: Visible-Occluded Decoupling for Monocular 3D Semantic Scene Completion

TL;DR

This work tackles monocular 3D Semantic Scene Completion by explicitly separating visible-region perception from occluded-region reasoning. It introduces VRLE to provide clean, offline supervision for visible voxels and a dual-decoder architecture (VD for visible priors and OD for full-scene completion) that interact bidirectionally to produce coherent 3D semantics and geometry. The Visible Embedding Feature Constructor (VEFC) and multi-level positional encoding strengthen cross-modal grounding, enabling precise 3D lifting from 2D features and robust voxel interaction. Experiments on SemanticKITTI and SSCBench-KITTI-360 demonstrate state-of-the-art geometric completion and semantic accuracy with competitive efficiency, highlighting the practical impact for autonomous driving and robotic scene understanding.

Abstract

Camera-based 3D Semantic Scene Completion (SSC) is a critical task for autonomous driving and robotic scene understanding. It aims to infer a complete 3D volumetric representation of both semantics and geometry from a single image. Existing methods typically focus on end-to-end 2D-to-3D feature lifting and voxel completion. However, they often overlook the interference between high-confidence visible-region perception and low-confidence occluded-region reasoning caused by single-image input, which can lead to feature dilution and error propagation. To address these challenges, we introduce an offline Visible Region Label Extraction (VRLE) strategy that explicitly separates and extracts voxel-level supervision for visible regions from dense 3D ground truth. This strategy purifies the supervisory space for two complementary sub-tasks: visible-region perception and occluded-region reasoning. Building on this idea, we propose the Visible-Occluded Interactive Completion Network (VOIC), a novel dual-decoder framework that explicitly decouples SSC into visible-region semantic perception and occluded-region scene completion. VOIC first constructs a base 3D voxel representation by fusing image features with depth-derived occupancy. The visible decoder focuses on generating high-fidelity geometric and semantic priors, while the occlusion decoder leverages these priors together with cross-modal interaction to perform coherent global scene reasoning. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that VOIC outperforms existing monocular SSC methods in both geometric completion and semantic segmentation accuracy, achieving state-of-the-art performance.

Paper Structure

This paper contains 17 sections, 11 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of the proposed VOIC framework. Unlike conventional semantic scene completion (SSC) methods that directly supervise with full ground-truth labels, VOIC adopts an explicitly decoupled progressive process. 2D features are first lifted to 3D via cross-modal feature fusion. The Visible Decoder (VD) processes observed regions, and the Occlusion Decoder (OD) leverages normalized VD features as priors to reconstruct the complete 3D scene.
  • Figure 2: Overall architecture of the VOIC framework. (a) The model follows a progressive visible–occluded paradigm that decouples the monocular $3$D Semantic Scene Completion (SSC) task. The Visible Embedding Feature Constructor (VEFC) lifts $2$D image features into $3$D and fuses them with depth-derived occupancy to form a unified volumetric representation. (b) The Visible Decoder (VD) predicts the geometry and semantics of the observed regions under explicit visible-region voxel supervision. (c) The Occlusion Decoder (OD) takes the normalized visible features from VD as spatial–semantic priors and completes the full $3$D scene structure.
  • Figure 3: Sparse Voxel Feature Initialization. The VEFC module creates a geometry-driven 3D representation using a zero-initialized query. An occupancy mask sparsifies the structure, while 3D positional encoding enables precise localization. Deformable Attention then aggregates 2D image features into the visible voxels.
  • Figure 4: Qualitative results on the SemanticKITTI validation set. VOIC enhances overall scene classification through high-quality visible-range semantic priors predicted by VD. Shown here is a comparison between VOIC and failure cases of Symphonize jiangSymphonize3dSemantic2024.