AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation

Siqi Du; Weixi Wang; Renzhong Guo; Ruisheng Wang; Yibin Tian; Shengjun Tang

AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation

Siqi Du, Weixi Wang, Renzhong Guo, Ruisheng Wang, Yibin Tian, Shengjun Tang

TL;DR

This paper addresses real-time indoor RGB-D semantic segmentation by balancing accuracy with inference speed. It introduces AsymFormer, an architecture that uses an asymmetric backbone for RGB and Depth, coupled with Local Attention Guided Feature Selection (LAFS) and Cross-Modal Attention (CMA) to efficiently fuse multimodal features, followed by a lightweight MLP-Decoder. The approach achieves competitive mIoU on NYUv2 ($54.1\%$) and SUNRGBD ($49.1\%$) while delivering real-time speeds (65 FPS on RTX 3090, 79 FPS with mixed precision), demonstrating a favorable accuracy-efficiency trade-off. Key contributions include the asymmetric backbone design to reduce redundancy, the learnable LAFS module for spatial feature compression, and CMA for cross-modal self-similarity embedding, enabling robust RGB-D fusion in a lightweight framework. These results suggest strong practical potential for mobile platforms and robotics where real-time perception is critical, with further gains anticipated from self-supervised pre-training.

Abstract

Understanding indoor scenes is crucial for urban studies. Considering the dynamic nature of indoor environments, effective semantic segmentation requires both real-time operation and high accuracy.To address this, we propose AsymFormer, a novel network that improves real-time semantic segmentation accuracy using RGB-D multi-modal information without substantially increasing network complexity. AsymFormer uses an asymmetrical backbone for multimodal feature extraction, reducing redundant parameters by optimizing computational resource distribution. To fuse asymmetric multimodal features, a Local Attention-Guided Feature Selection (LAFS) module is used to selectively fuse features from different modalities by leveraging their dependencies. Subsequently, a Cross-Modal Attention-Guided Feature Correlation Embedding (CMA) module is introduced to further extract cross-modal representations. The AsymFormer demonstrates competitive results with 54.1% mIoU on NYUv2 and 49.1% mIoU on SUNRGBD. Notably, AsymFormer achieves an inference speed of 65 FPS (79 FPS after implementing mixed precision quantization) on RTX3090, demonstrating that AsymFormer can strike a balance between high accuracy and efficiency.

AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation

TL;DR

) and SUNRGBD (

) while delivering real-time speeds (65 FPS on RTX 3090, 79 FPS with mixed precision), demonstrating a favorable accuracy-efficiency trade-off. Key contributions include the asymmetric backbone design to reduce redundancy, the learnable LAFS module for spatial feature compression, and CMA for cross-modal self-similarity embedding, enabling robust RGB-D fusion in a lightweight framework. These results suggest strong practical potential for mobile platforms and robotics where real-time perception is critical, with further gains anticipated from self-supervised pre-training.

Abstract

Paper Structure (23 sections, 5 equations, 7 figures, 3 tables)

This paper contains 23 sections, 5 equations, 7 figures, 3 tables.

Introduction
Related Works
Indoor Scene Understanding
RGB-D Representation Learning
RGB-D feature fusion
Method
Framework Overview
Local Attention Guided Feature Selection
Cross-Attention Guided Feature Embedding
Definition of Cross-Modal Self-Similarity:
Feature Embedding
Splitting and Mixing of Multimodal Information
Representation Learning in Multiple Subspaces
EXPERIMENT RESULTS
Implementation Details
...and 8 more sections

Figures (7)

Figure 1: The AsymFormer has 33.0 million parameters and 36.0 GFLOPs computational cost, and it can achieve 65 FPS inference speed on RTX 3090, 54.1% mIoU on NYUv2.
Figure 2: Overview of AsymFormer.
Figure 3: LAFS.
Figure 4: Feature Embedding.
Figure 5: Splitting and Mixing of Multimodal Information.
...and 2 more figures

AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation

TL;DR

Abstract

AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)