Table of Contents
Fetching ...

Parameter-Free Adaptive Multi-Scale Channel-Spatial Attention Aggregation framework for 3D Indoor Semantic Scene Completion Toward Assisting Visually Impaired

Qi He, XiangXiang Wang, Jingtao Zhang, Yongbin Yu, Hongxiang Chu, Manping Fan, JingYe Cai, Zhenglin Yang

TL;DR

An Adaptive Multi-scale Attention Aggregation framework built upon the MonoScene pipeline that improves monocular SSC quality and provides a reliable and deployable perception framework for indoor assistive systems targeting visually impaired users.

Abstract

In indoor assistive perception for visually impaired users, 3D Semantic Scene Completion (SSC) is expected to provide structurally coherent and semantically consistent occupancy under strictly monocular vision for safety-critical scene understanding. However, existing monocular SSC approaches often lack explicit modeling of voxel-feature reliability and regulated cross-scale information propagation during 2D-3D projection and multi-scale fusion, making them vulnerable to projection diffusion and feature entanglement and thus limiting structural stability. To address these challenges, this paper presents an Adaptive Multi-scale Attention Aggregation (AMAA) framework built upon the MonoScene pipeline. Rather than introducing a heavier backbone, AMAA focuses on reliability-oriented feature regulation within a monocular SSC framework. Specifically, lifted voxel features are jointly calibrated in semantic and spatial dimensions through parallel channel-spatial attention aggregation, while multi-scale encoder-decoder fusion is stabilized via a hierarchical adaptive feature-gating strategy that regulates information injection across scales. Experiments on the NYUv2 benchmark demonstrate consistent improvements over MonoScene without significantly increasing system complexity: AMAA achieves 27.25% SSC mIoU (+0.31) and 43.10% SC IoU (+0.59). In addition, system-level deployment on an NVIDIA Jetson platform verifies that the complete AMAA framework can be executed stably on embedded hardware. Overall, AMAA improves monocular SSC quality and provides a reliable and deployable perception framework for indoor assistive systems targeting visually impaired users.

Parameter-Free Adaptive Multi-Scale Channel-Spatial Attention Aggregation framework for 3D Indoor Semantic Scene Completion Toward Assisting Visually Impaired

TL;DR

An Adaptive Multi-scale Attention Aggregation framework built upon the MonoScene pipeline that improves monocular SSC quality and provides a reliable and deployable perception framework for indoor assistive systems targeting visually impaired users.

Abstract

In indoor assistive perception for visually impaired users, 3D Semantic Scene Completion (SSC) is expected to provide structurally coherent and semantically consistent occupancy under strictly monocular vision for safety-critical scene understanding. However, existing monocular SSC approaches often lack explicit modeling of voxel-feature reliability and regulated cross-scale information propagation during 2D-3D projection and multi-scale fusion, making them vulnerable to projection diffusion and feature entanglement and thus limiting structural stability. To address these challenges, this paper presents an Adaptive Multi-scale Attention Aggregation (AMAA) framework built upon the MonoScene pipeline. Rather than introducing a heavier backbone, AMAA focuses on reliability-oriented feature regulation within a monocular SSC framework. Specifically, lifted voxel features are jointly calibrated in semantic and spatial dimensions through parallel channel-spatial attention aggregation, while multi-scale encoder-decoder fusion is stabilized via a hierarchical adaptive feature-gating strategy that regulates information injection across scales. Experiments on the NYUv2 benchmark demonstrate consistent improvements over MonoScene without significantly increasing system complexity: AMAA achieves 27.25% SSC mIoU (+0.31) and 43.10% SC IoU (+0.59). In addition, system-level deployment on an NVIDIA Jetson platform verifies that the complete AMAA framework can be executed stably on embedded hardware. Overall, AMAA improves monocular SSC quality and provides a reliable and deployable perception framework for indoor assistive systems targeting visually impaired users.
Paper Structure (29 sections, 15 equations, 9 figures, 5 tables)

This paper contains 29 sections, 15 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Overall architecture of the proposed AMAA framework. Multi-scale 2D features extracted from the input RGB image are lifted into a 3D voxel space via the FLoSP projection operator. The lifted voxel features are refined by a parallel channel--spatial aggregation module (SEBlock-3D and SimAM-3D), followed by a 3D decoder equipped with Adaptive Feature Gating (AFG) blocks for regulated multi-scale fusion, producing voxel-wise semantic occupancy predictions over the 3D voxel grid.
  • Figure 2: SEBlock-3D for channel-wise recalibration in voxel space. The module applies global 3D average pooling to summarize channel responses and uses a lightweight bottleneck MLP to generate channel-wise modulation weights, enhancing informative semantic channels in volumetric features.
  • Figure 3: SimAM-3D for parameter-free spatial reweighting in voxel space. The module computes an energy-based importance score for each voxel activation using region-wise activation statistics, producing a spatial attention map without introducing additional learnable parameters.
  • Figure 4: Adaptive Feature Gating (AFG) module for regulated multi-scale fusion. AFG predicts a voxel-wise gating map conditioned on decoder and encoder features, enabling selective injection of fine-scale encoder information into the decoder through gated residual fusion.
  • Figure 5: Category-wise SSC IoU distributions on NYUv2 across methods. The boxplots summarize per-category IoU scores (median, interquartile range, and whiskers) for LMSCNet, AICNet, 3DSketch, MonoScene, and the proposed AMAA. Overall, AMAA exhibits consistently higher medians across most semantic categories, indicating more reliable semantic occupancy completion under monocular constraints.
  • ...and 4 more figures