Table of Contents
Fetching ...

Skip Mamba Diffusion for Monocular 3D Semantic Scene Completion

Li Liang, Naveed Akhtar, Jordan Vice, Xiangrui Kong, Ajmal Saeed Mian

TL;DR

This work tackles monocular 3D semantic scene completion by embedding diffusion denoising in the latent space of a variational autoencoder, enabling efficient, high-fidelity 3D reconstruction from a single image. The core innovation is the Skimba denoising diffusion network built on Skip Triple Mamba, with varying dilations across forward, reverse, and spatial directions, and a Multi-Scale Convolution Block to capture multi-scale context. The model jointly performs 3D scene completion and a segmentation path, trained with a composite loss that combines diffusion denoising objectives and Lovasz-augmented cross-entropy, all within a VAE-conditioned framework. Extensive experiments on SemanticKITTI and SSCBench-KITTI-360 show state-of-the-art monocular performance and competitive stereo results, validating the approach's robustness and applicability to autonomous navigation tasks; code is publicly available for reproducibility.

Abstract

3D semantic scene completion is critical for multiple downstream tasks in autonomous systems. It estimates missing geometric and semantic information in the acquired scene data. Due to the challenging real-world conditions, this task usually demands complex models that process multi-modal data to achieve acceptable performance. We propose a unique neural model, leveraging advances from the state space and diffusion generative modeling to achieve remarkable 3D semantic scene completion performance with monocular image input. Our technique processes the data in the conditioned latent space of a variational autoencoder where diffusion modeling is carried out with an innovative state space technique. A key component of our neural network is the proposed Skimba (Skip Mamba) denoiser, which is adept at efficiently processing long-sequence data. The Skimba diffusion model is integral to our 3D scene completion network, incorporating a triple Mamba structure, dimensional decomposition residuals and varying dilations along three directions. We also adopt a variant of this network for the subsequent semantic segmentation stage of our method. Extensive evaluation on the standard SemanticKITTI and SSCBench-KITTI360 datasets show that our approach not only outperforms other monocular techniques by a large margin, it also achieves competitive performance against stereo methods. The code is available at https://github.com/xrkong/skimba

Skip Mamba Diffusion for Monocular 3D Semantic Scene Completion

TL;DR

This work tackles monocular 3D semantic scene completion by embedding diffusion denoising in the latent space of a variational autoencoder, enabling efficient, high-fidelity 3D reconstruction from a single image. The core innovation is the Skimba denoising diffusion network built on Skip Triple Mamba, with varying dilations across forward, reverse, and spatial directions, and a Multi-Scale Convolution Block to capture multi-scale context. The model jointly performs 3D scene completion and a segmentation path, trained with a composite loss that combines diffusion denoising objectives and Lovasz-augmented cross-entropy, all within a VAE-conditioned framework. Extensive experiments on SemanticKITTI and SSCBench-KITTI-360 show state-of-the-art monocular performance and competitive stereo results, validating the approach's robustness and applicability to autonomous navigation tasks; code is publicly available for reproducibility.

Abstract

3D semantic scene completion is critical for multiple downstream tasks in autonomous systems. It estimates missing geometric and semantic information in the acquired scene data. Due to the challenging real-world conditions, this task usually demands complex models that process multi-modal data to achieve acceptable performance. We propose a unique neural model, leveraging advances from the state space and diffusion generative modeling to achieve remarkable 3D semantic scene completion performance with monocular image input. Our technique processes the data in the conditioned latent space of a variational autoencoder where diffusion modeling is carried out with an innovative state space technique. A key component of our neural network is the proposed Skimba (Skip Mamba) denoiser, which is adept at efficiently processing long-sequence data. The Skimba diffusion model is integral to our 3D scene completion network, incorporating a triple Mamba structure, dimensional decomposition residuals and varying dilations along three directions. We also adopt a variant of this network for the subsequent semantic segmentation stage of our method. Extensive evaluation on the standard SemanticKITTI and SSCBench-KITTI360 datasets show that our approach not only outperforms other monocular techniques by a large margin, it also achieves competitive performance against stereo methods. The code is available at https://github.com/xrkong/skimba
Paper Structure (11 sections, 6 equations, 3 figures, 4 tables)

This paper contains 11 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Schematics of the approach. Our method comprises a 3D scene completion and a 3D semantic segmentation network. The former is encapsulated in a VAE framework that employs two sub-networks for conditioning its latent space, a Muti-Scale Convolutonal Block (MSCB) and a Skimba denoising network. The 3D semantic segmentation network employs a variant of Skimba. L, W, and H denote the length, width, and height of the original scene, and D is feature map dimension.
  • Figure 2: Architectural details of the Skimba denoising network. Refer to the text for details.
  • Figure 3: Qualitative results on the SemanticKITTI validation set. Columns from the left represent, input data, ground truth, and outputs of SkimbaDiff (our method), MonoScene, OccFormer, and VoxFormer-T (a stereo method).