HDBFormer: Efficient RGB-D Semantic Segmentation with A Heterogeneous Dual-Branch Framework

Shuobin Wei; Zhuang Zhou; Zhengan Lu; Zizhao Yuan; Binghua Su

HDBFormer: Efficient RGB-D Semantic Segmentation with A Heterogeneous Dual-Branch Framework

Shuobin Wei, Zhuang Zhou, Zhengan Lu, Zizhao Yuan, Binghua Su

TL;DR

HDBFormer tackles RGB-D indoor semantic segmentation by acknowledging modality differences and employing a heterogeneous dual-branch architecture. The method combines RGB two-stream encoders with a lightweight LDFormer for depth, linked by the Modality Information Interaction Module (MIIM) that fuses global and local information across modalities. Empirical results on NYUDepthv2 and SUN-RGBD show state-of-the-art performance, validating both the depth-efficient encoder and the targeted fusion strategy. The contributions—LDFormer, MIIM, and graded feature processing with iterative fusion—offer a robust framework for efficient multimodal fusion in complex indoor scenes and can extend to other cross-modal applications.

Abstract

In RGB-D semantic segmentation for indoor scenes, a key challenge is effectively integrating the rich color information from RGB images with the spatial distance information from depth images. However, most existing methods overlook the inherent differences in how RGB and depth images express information. Properly distinguishing the processing of RGB and depth images is essential to fully exploiting their unique and significant characteristics. To address this, we propose a novel heterogeneous dual-branch framework called HDBFormer, specifically designed to handle these modality differences. For RGB images, which contain rich detail, we employ both a basic and detail encoder to extract local and global features. For the simpler depth images, we propose LDFormer, a lightweight hierarchical encoder that efficiently extracts depth features with fewer parameters. Additionally, we introduce the Modality Information Interaction Module (MIIM), which combines transformers with large kernel convolutions to interact global and local information across modalities efficiently. Extensive experiments show that HDBFormer achieves state-of-the-art performance on the NYUDepthv2 and SUN-RGBD datasets. The code is available at: https://github.com/Weishuobin/HDBFormer.

HDBFormer: Efficient RGB-D Semantic Segmentation with A Heterogeneous Dual-Branch Framework

TL;DR

Abstract

HDBFormer: Efficient RGB-D Semantic Segmentation with A Heterogeneous Dual-Branch Framework

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)