Table of Contents
Fetching ...

Global-Aware Monocular Semantic Scene Completion with State Space Models

Shijie Li, Zhongyao Cheng, Rong Li, Shuai Li, Juergen Gall, Xun Xu, Xulei Yang

TL;DR

GA-MonoSSC tackles monocular semantic scene completion by introducing global context modeling in both 2D and 3D. It combines a Transformer-based Dual-Head Multi-Modality Encoder for rich 2D feature extraction with a Frustum Mamba Decoder leveraging State Space Models for efficient 3D long-range reasoning. A frustum-based voxel reordering strategy aligns data with the Mamba processing, improving continuity and 3D representation learning. Experiments on Occ-ScanNet and NYUv2 demonstrate state-of-the-art results, validating the approach's effectiveness for indoor scene reconstruction from a single image.

Abstract

Monocular Semantic Scene Completion (MonoSSC) reconstructs and interprets 3D environments from a single image, enabling diverse real-world applications. However, existing methods are often constrained by the local receptive field of Convolutional Neural Networks (CNNs), making it challenging to handle the non-uniform distribution of projected points (Fig. \ref{fig:perspective}) and effectively reconstruct missing information caused by the 3D-to-2D projection. In this work, we introduce GA-MonoSSC, a hybrid architecture for MonoSSC that effectively captures global context in both the 2D image domain and 3D space. Specifically, we propose a Dual-Head Multi-Modality Encoder, which leverages a Transformer architecture to capture spatial relationships across all features in the 2D image domain, enabling more comprehensive 2D feature extraction. Additionally, we introduce the Frustum Mamba Decoder, built on the State Space Model (SSM), to efficiently capture long-range dependencies in 3D space. Furthermore, we propose a frustum reordering strategy within the Frustum Mamba Decoder to mitigate feature discontinuities in the reordered voxel sequence, ensuring better alignment with the scan mechanism of the State Space Model (SSM) for improved 3D representation learning. We conduct extensive experiments on the widely used Occ-ScanNet and NYUv2 datasets, demonstrating that our proposed method achieves state-of-the-art performance, validating its effectiveness. The code will be released upon acceptance.

Global-Aware Monocular Semantic Scene Completion with State Space Models

TL;DR

GA-MonoSSC tackles monocular semantic scene completion by introducing global context modeling in both 2D and 3D. It combines a Transformer-based Dual-Head Multi-Modality Encoder for rich 2D feature extraction with a Frustum Mamba Decoder leveraging State Space Models for efficient 3D long-range reasoning. A frustum-based voxel reordering strategy aligns data with the Mamba processing, improving continuity and 3D representation learning. Experiments on Occ-ScanNet and NYUv2 demonstrate state-of-the-art results, validating the approach's effectiveness for indoor scene reconstruction from a single image.

Abstract

Monocular Semantic Scene Completion (MonoSSC) reconstructs and interprets 3D environments from a single image, enabling diverse real-world applications. However, existing methods are often constrained by the local receptive field of Convolutional Neural Networks (CNNs), making it challenging to handle the non-uniform distribution of projected points (Fig. \ref{fig:perspective}) and effectively reconstruct missing information caused by the 3D-to-2D projection. In this work, we introduce GA-MonoSSC, a hybrid architecture for MonoSSC that effectively captures global context in both the 2D image domain and 3D space. Specifically, we propose a Dual-Head Multi-Modality Encoder, which leverages a Transformer architecture to capture spatial relationships across all features in the 2D image domain, enabling more comprehensive 2D feature extraction. Additionally, we introduce the Frustum Mamba Decoder, built on the State Space Model (SSM), to efficiently capture long-range dependencies in 3D space. Furthermore, we propose a frustum reordering strategy within the Frustum Mamba Decoder to mitigate feature discontinuities in the reordered voxel sequence, ensuring better alignment with the scan mechanism of the State Space Model (SSM) for improved 3D representation learning. We conduct extensive experiments on the widely used Occ-ScanNet and NYUv2 datasets, demonstrating that our proposed method achieves state-of-the-art performance, validating its effectiveness. The code will be released upon acceptance.

Paper Structure

This paper contains 16 sections, 23 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Perspective projection results in an imbalanced distribution of projected points, where nearby points are more widely scattered while distant points are more densely packed. The real projection map is shown on the right.
  • Figure 2: Arranging voxel features into a sequence based on their 3D coordinates introduces discontinuities (right). This issue can be mitigated by sorting voxels within the frustum space, where voxel features are distributed across the image plane and exhibit strong similarity along the depth dimension (left).
  • Figure 3: The overview of the proposed Dual-head Multi-modality Encoder (DM$_{\textbf{Enc}}$) which will extract multi-scale geometry and semantic features separately
  • Figure 4: Architecture of the proposed Frustum Mamba Decoder $\mathbf{FM_{dec}}$. $\mathbf{FM_{dec}}$ adopts an encoder-decoder architecture, with the proposed Frustum Mamba Layer integrated into each stage of the encoder to capture long-range dependency. The Frustum Reordering Strategy is introduced to arrange voxels into a sequence within frustum space rather than the original 3D space, effectively mitigating feature discontinuities and improving sequential consistency.
  • Figure 5: Qualitative results on NYU dataset. The red rectangle indicates incorrect semantic predictions. Compared to previous methods, the proposed MambaSSC produces significantly fewer incorrect predictions.