AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation
Siqi Du, Weixi Wang, Renzhong Guo, Ruisheng Wang, Yibin Tian, Shengjun Tang
TL;DR
This paper addresses real-time indoor RGB-D semantic segmentation by balancing accuracy with inference speed. It introduces AsymFormer, an architecture that uses an asymmetric backbone for RGB and Depth, coupled with Local Attention Guided Feature Selection (LAFS) and Cross-Modal Attention (CMA) to efficiently fuse multimodal features, followed by a lightweight MLP-Decoder. The approach achieves competitive mIoU on NYUv2 ($54.1\%$) and SUNRGBD ($49.1\%$) while delivering real-time speeds (65 FPS on RTX 3090, 79 FPS with mixed precision), demonstrating a favorable accuracy-efficiency trade-off. Key contributions include the asymmetric backbone design to reduce redundancy, the learnable LAFS module for spatial feature compression, and CMA for cross-modal self-similarity embedding, enabling robust RGB-D fusion in a lightweight framework. These results suggest strong practical potential for mobile platforms and robotics where real-time perception is critical, with further gains anticipated from self-supervised pre-training.
Abstract
Understanding indoor scenes is crucial for urban studies. Considering the dynamic nature of indoor environments, effective semantic segmentation requires both real-time operation and high accuracy.To address this, we propose AsymFormer, a novel network that improves real-time semantic segmentation accuracy using RGB-D multi-modal information without substantially increasing network complexity. AsymFormer uses an asymmetrical backbone for multimodal feature extraction, reducing redundant parameters by optimizing computational resource distribution. To fuse asymmetric multimodal features, a Local Attention-Guided Feature Selection (LAFS) module is used to selectively fuse features from different modalities by leveraging their dependencies. Subsequently, a Cross-Modal Attention-Guided Feature Correlation Embedding (CMA) module is introduced to further extract cross-modal representations. The AsymFormer demonstrates competitive results with 54.1% mIoU on NYUv2 and 49.1% mIoU on SUNRGBD. Notably, AsymFormer achieves an inference speed of 65 FPS (79 FPS after implementing mixed precision quantization) on RTX3090, demonstrating that AsymFormer can strike a balance between high accuracy and efficiency.
