Table of Contents
Fetching ...

DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation

Bowen Yin, Xuying Zhang, Zhongyu Li, Li Liu, Ming-Ming Cheng, Qibin Hou

TL;DR

DFormer addresses RGB-D segmentation by pretraining a backbone on image-depth pairs from ImageNet-1K, enabling intrinsic RGB-D representation learning rather than relying on RGB backbones for depth. It introduces a hierarchical RGB-D encoder with Global Awareness Attention and Local Enhancement Attention blocks that fuse RGB and depth features during pretraining, paired with a lightweight decoder that can rely primarily on RGB features at finetuning. The approach achieves state-of-the-art results on NYU Depthv2 and SUN-RGBD for semantic segmentation and dominates RGB-D salient object detection benchmarks while reducing computational costs. This work advances practical RGB-D understanding by delivering transferable representations with improved efficiency and interaction between modalities at all stages of learning.

Abstract

We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) Unlike previous works that encode RGB-D information with RGB pretrained backbone, we pretrain the backbone using image-depth pairs from ImageNet-1K, and hence the DFormer is endowed with the capacity to encode RGB-D representations; 2) DFormer comprises a sequence of RGB-D blocks, which are tailored for encoding both RGB and depth information through a novel building block design. DFormer avoids the mismatched encoding of the 3D geometry relationships in depth maps by RGB pretrained backbones, which widely lies in existing methods but has not been resolved. We finetune the pretrained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieves new state-of-the-art performance on these two tasks with less than half of the computational cost of the current best methods on two RGB-D semantic segmentation datasets and five RGB-D salient object detection datasets. Our code is available at: https://github.com/VCIP-RGBD/DFormer.

DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation

TL;DR

DFormer addresses RGB-D segmentation by pretraining a backbone on image-depth pairs from ImageNet-1K, enabling intrinsic RGB-D representation learning rather than relying on RGB backbones for depth. It introduces a hierarchical RGB-D encoder with Global Awareness Attention and Local Enhancement Attention blocks that fuse RGB and depth features during pretraining, paired with a lightweight decoder that can rely primarily on RGB features at finetuning. The approach achieves state-of-the-art results on NYU Depthv2 and SUN-RGBD for semantic segmentation and dominates RGB-D salient object detection benchmarks while reducing computational costs. This work advances practical RGB-D understanding by delivering transferable representations with improved efficiency and interaction between modalities at all stages of learning.

Abstract

We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) Unlike previous works that encode RGB-D information with RGB pretrained backbone, we pretrain the backbone using image-depth pairs from ImageNet-1K, and hence the DFormer is endowed with the capacity to encode RGB-D representations; 2) DFormer comprises a sequence of RGB-D blocks, which are tailored for encoding both RGB and depth information through a novel building block design. DFormer avoids the mismatched encoding of the 3D geometry relationships in depth maps by RGB pretrained backbones, which widely lies in existing methods but has not been resolved. We finetune the pretrained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieves new state-of-the-art performance on these two tasks with less than half of the computational cost of the current best methods on two RGB-D semantic segmentation datasets and five RGB-D salient object detection datasets. Our code is available at: https://github.com/VCIP-RGBD/DFormer.
Paper Structure (21 sections, 4 equations, 15 figures, 18 tables)

This paper contains 21 sections, 4 equations, 15 figures, 18 tables.

Figures (15)

  • Figure 1: Comparisons between the existing popular training pipeline and ours for RGB-D segmentation. RGB pretraining: Recent mainstream methods adopt two RGB pretrained backbones to separately encode RGB and depth information and fuse them at each stage. RGB-D pretraining: The RGB-D backbone in DFormer learns transferable RGB-D representations during pretraining and then is finetuned for segmentation.
  • Figure 2: Performance vs. computational cost on the NYUDepthv2 dataset silberman2012nyu_dataset. DFormer achieves the state-of-the-art 57.2% mIoU and the best trade-off compared to other methods.
  • Figure 3: Overall architecture of the proposed DFormer. First, we use the pretrained DFormer to encode the RGB-D data. Then, the features from the last three stages are concatenated and delivered to a lightweight decoder head for final prediction. Note that only the RGB features from the encoder are used in the decoder.
  • Figure 4: Diagrammatic details on how to conduct interactions between RGB and depth features.
  • Figure 5: Visualizations of the feature maps around the last RGB-D block of the first stage.
  • ...and 10 more figures