Skip Mamba Diffusion for Monocular 3D Semantic Scene Completion
Li Liang, Naveed Akhtar, Jordan Vice, Xiangrui Kong, Ajmal Saeed Mian
TL;DR
This work tackles monocular 3D semantic scene completion by embedding diffusion denoising in the latent space of a variational autoencoder, enabling efficient, high-fidelity 3D reconstruction from a single image. The core innovation is the Skimba denoising diffusion network built on Skip Triple Mamba, with varying dilations across forward, reverse, and spatial directions, and a Multi-Scale Convolution Block to capture multi-scale context. The model jointly performs 3D scene completion and a segmentation path, trained with a composite loss that combines diffusion denoising objectives and Lovasz-augmented cross-entropy, all within a VAE-conditioned framework. Extensive experiments on SemanticKITTI and SSCBench-KITTI-360 show state-of-the-art monocular performance and competitive stereo results, validating the approach's robustness and applicability to autonomous navigation tasks; code is publicly available for reproducibility.
Abstract
3D semantic scene completion is critical for multiple downstream tasks in autonomous systems. It estimates missing geometric and semantic information in the acquired scene data. Due to the challenging real-world conditions, this task usually demands complex models that process multi-modal data to achieve acceptable performance. We propose a unique neural model, leveraging advances from the state space and diffusion generative modeling to achieve remarkable 3D semantic scene completion performance with monocular image input. Our technique processes the data in the conditioned latent space of a variational autoencoder where diffusion modeling is carried out with an innovative state space technique. A key component of our neural network is the proposed Skimba (Skip Mamba) denoiser, which is adept at efficiently processing long-sequence data. The Skimba diffusion model is integral to our 3D scene completion network, incorporating a triple Mamba structure, dimensional decomposition residuals and varying dilations along three directions. We also adopt a variant of this network for the subsequent semantic segmentation stage of our method. Extensive evaluation on the standard SemanticKITTI and SSCBench-KITTI360 datasets show that our approach not only outperforms other monocular techniques by a large margin, it also achieves competitive performance against stereo methods. The code is available at https://github.com/xrkong/skimba
