Table of Contents
Fetching ...

AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments

Xuzhi Wang, Xinran Wu, Song Wang, Lingdong Kong, Ziping Zhao

Abstract

Indoor monocular semantic scene completion (MSSC) is notably more challenging than its outdoor counterpart due to complex spatial layouts and severe occlusions. While transformers are well suited for modeling global dependencies, their high memory cost and difficulty in reconstructing fine-grained details have limited their use in indoor MSSC. To address these limitations, we introduce AdaSFormer, a serialized transformer framework tailored for indoor MSSC. Our model features three key designs: (1) an Adaptive Serialized Transformer with learnable shifts that dynamically adjust receptive fields; (2) a Center-Relative Positional Encoding that captures spatial information richness; and (3) a Convolution-Modulated Layer Normalization that bridges heterogeneous representations between convolutional and transformer features. Extensive experiments on NYUv2 and Occ-ScanNet demonstrate that AdaSFormer achieves state-of-the-art performance. The code is publicly available at: https://github.com/alanWXZ/AdaSFormer.

AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments

Abstract

Indoor monocular semantic scene completion (MSSC) is notably more challenging than its outdoor counterpart due to complex spatial layouts and severe occlusions. While transformers are well suited for modeling global dependencies, their high memory cost and difficulty in reconstructing fine-grained details have limited their use in indoor MSSC. To address these limitations, we introduce AdaSFormer, a serialized transformer framework tailored for indoor MSSC. Our model features three key designs: (1) an Adaptive Serialized Transformer with learnable shifts that dynamically adjust receptive fields; (2) a Center-Relative Positional Encoding that captures spatial information richness; and (3) a Convolution-Modulated Layer Normalization that bridges heterogeneous representations between convolutional and transformer features. Extensive experiments on NYUv2 and Occ-ScanNet demonstrate that AdaSFormer achieves state-of-the-art performance. The code is publicly available at: https://github.com/alanWXZ/AdaSFormer.

Paper Structure

This paper contains 14 sections, 14 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: AdaSFormer Framework. (a) Overall network architecture: It mainly consists of a 2D encoder, a depth estimation network, a 2D-to-3D projection module, and a 3D network. (b) Adaptive Serialized Transformer (ASF): This is the core of our method. A set of ASF blocks forms the encoder of the 3D network. The 3D voxels are adaptively converted into 1D patches, which are then augmented with center-relative positional encoding. These patches are processed through standard self-attention and feed-forward layers. Through the proposed Convolution-Modulated Layer Normalization (CMLN), the heterogeneous features from the transformer and convolutional layers are bridged. The resulting features are then fed into a DDR block for scene completion. (c) Convolution-Modulated Layer Normalization (CMLN): Modulates features to integrate transformer and convolutional representations effectively. (d) Center-Relative Positional Encoding: Enhances the richness of input information by modeling the relative spatial relationships of patches.
  • Figure 2: (a) illustrates the Z-order serialization of a 2D image, as well as that of a sparse 2D image, where the letters “C” and “V” in the figure indicate the locations of objects in the spatial domain. (b)-(e) illustration of Z-order serialization with shifts of 0, 3, 6, and 9 (from left to right). Different shift settings significantly affect the coverage and overlap between adjacent patches, especially for large patch sizes.
  • Figure 3: Visualization of the ablation results on the Occ-ScanNet dataset ssc_ISO. Introducing AdaSFormer significantly improves the model’s capability to capture spatial and semantic relationships, leading to more complete and coherent object completion.
  • Figure 4: Qualitative Comparisons of semantic scene completion results on the NYUv2 testset dataset_nyu with different methods. Figures from left to right: (a) The single RGB input; (b) The ground truth; (c) ISO ssc_ISO; (d) Baseline; and (e) Our proposed method.