Table of Contents
Fetching ...

DiffPixelFormer: Differential Pixel-Aware Transformer for RGB-D Indoor Scene Segmentation

Yan Gong, Jianli Lu, Yongsheng Gao, Jie Zhao, Xiaojuan Zhang, Susanto Rahardja

TL;DR

DiffPixelFormer tackles RGB-D indoor scene segmentation by combining intra-modal self-attention with a differential/shared inter-modal module (DSIM) to achieve pixel-level cross-modal alignment. The method introduces a light-weight, pixel-aware cross-attention mechanism and an adaptive fusion strategy that distinguishes modality-specific from shared cues, reducing computation relative to standard cross-attention. Empirical results on SUN RGB-D and NYUDv2 show state-of-the-art mIoU scores while maintaining real-time speed (~41.66 FPS) and lower parameter counts. The work advances robust RGB-D fusion for indoor perception with potential extensions to broader multimodal tasks and missing-modality scenarios.

Abstract

Indoor semantic segmentation is fundamental to computer vision and robotics, supporting applications such as autonomous navigation, augmented reality, and smart environments. Although RGB-D fusion leverages complementary appearance and geometric cues, existing methods often depend on computationally intensive cross-attention mechanisms and insufficiently model intra- and inter-modal feature relationships, resulting in imprecise feature alignment and limited discriminative representation. To address these challenges, we propose DiffPixelFormer, a differential pixel-aware Transformer for RGB-D indoor scene segmentation that simultaneously enhances intra-modal representations and models inter-modal interactions. At its core, the Intra-Inter Modal Interaction Block (IIMIB) captures intra-modal long-range dependencies via self-attention and models inter-modal interactions with the Differential-Shared Inter-Modal (DSIM) module to disentangle modality-specific and shared cues, enabling fine-grained, pixel-level cross-modal alignment. Furthermore, a dynamic fusion strategy balances modality contributions and fully exploits RGB-D information according to scene characteristics. Extensive experiments on the SUN RGB-D and NYUDv2 benchmarks demonstrate that DiffPixelFormer-L achieves mIoU scores of 54.28% and 59.95%, outperforming DFormer-L by 1.78% and 2.75%, respectively. Code is available at https://github.com/gongyan1/DiffPixelFormer.

DiffPixelFormer: Differential Pixel-Aware Transformer for RGB-D Indoor Scene Segmentation

TL;DR

DiffPixelFormer tackles RGB-D indoor scene segmentation by combining intra-modal self-attention with a differential/shared inter-modal module (DSIM) to achieve pixel-level cross-modal alignment. The method introduces a light-weight, pixel-aware cross-attention mechanism and an adaptive fusion strategy that distinguishes modality-specific from shared cues, reducing computation relative to standard cross-attention. Empirical results on SUN RGB-D and NYUDv2 show state-of-the-art mIoU scores while maintaining real-time speed (~41.66 FPS) and lower parameter counts. The work advances robust RGB-D fusion for indoor perception with potential extensions to broader multimodal tasks and missing-modality scenarios.

Abstract

Indoor semantic segmentation is fundamental to computer vision and robotics, supporting applications such as autonomous navigation, augmented reality, and smart environments. Although RGB-D fusion leverages complementary appearance and geometric cues, existing methods often depend on computationally intensive cross-attention mechanisms and insufficiently model intra- and inter-modal feature relationships, resulting in imprecise feature alignment and limited discriminative representation. To address these challenges, we propose DiffPixelFormer, a differential pixel-aware Transformer for RGB-D indoor scene segmentation that simultaneously enhances intra-modal representations and models inter-modal interactions. At its core, the Intra-Inter Modal Interaction Block (IIMIB) captures intra-modal long-range dependencies via self-attention and models inter-modal interactions with the Differential-Shared Inter-Modal (DSIM) module to disentangle modality-specific and shared cues, enabling fine-grained, pixel-level cross-modal alignment. Furthermore, a dynamic fusion strategy balances modality contributions and fully exploits RGB-D information according to scene characteristics. Extensive experiments on the SUN RGB-D and NYUDv2 benchmarks demonstrate that DiffPixelFormer-L achieves mIoU scores of 54.28% and 59.95%, outperforming DFormer-L by 1.78% and 2.75%, respectively. Code is available at https://github.com/gongyan1/DiffPixelFormer.

Paper Structure

This paper contains 23 sections, 11 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of receptive fields among different cross-attention-based multimodal fusion methods.
  • Figure 2: Performance comparison of different attention variants on SUN RGB-D and NYUDv2 in terms of parameters and mIoU. “CA” denotes Cross-Attention, “SWA” denotes Shifted Window Attention, “LCA” denotes Local Cross-Attention, and "PACA" denotes Pixel-Aware Cross-Attention.
  • Figure 3: The overall architecture of DiffPixelFormer adopts an encoder–decoder design, where the encoder employs multiple Intra-Inter Modal Interaction Blocks (IIMIBs) for efficient intra- and inter-modal fusion, and the decoder restores spatial and semantic details via multi-scale aggregation.
  • Figure 4: Quantitative comparison of our DiffPixelFormer with the baseline and various cross-attention methods on NYUDv2, where GT denotes the ground truth.
  • Figure 5: Quantitative comparison of our DiffPixelFormer with the baseline and various cross-attention methods on SUNRGB-D.