Table of Contents
Fetching ...

A Contrastive Fewshot RGBD Traversability Segmentation Framework for Indoor Robotic Navigation

Qiyuan An, Tuan Dang, Fillia Makedon

TL;DR

A multi-modal segmentation framework that leverages RGB images and sparse 1D laser depth information to capture geometric interactions and improve the detection of challenging obstacles and highlights the effectiveness of leveraging negative prototypes and sparse depth for robust and efficient traversability segmentation.

Abstract

Indoor traversability segmentation aims to identify safe, navigable free space for autonomous agents, which is critical for robotic navigation. Pure vision-based models often fail to detect thin obstacles, such as chair legs, which can pose serious safety risks. We propose a multi-modal segmentation framework that leverages RGB images and sparse 1D laser depth information to capture geometric interactions and improve the detection of challenging obstacles. To reduce the reliance on large labeled datasets, we adopt the few-shot segmentation (FSS) paradigm, enabling the model to generalize from limited annotated examples. Traditional FSS methods focus solely on positive prototypes, often leading to overfitting to the support set and poor generalization. To address this, we introduce a negative contrastive learning (NCL) branch that leverages negative prototypes (obstacles) to refine free-space predictions. Additionally, we design a two-stage attention depth module to align 1D depth vectors with RGB images both horizontally and vertically. Extensive experiments on our custom-collected indoor RGB-D traversability dataset demonstrate that our method outperforms state-of-the-art FSS and RGB-D segmentation baselines, achieving up to 9\% higher mIoU under both 1-shot and 5-shot settings. These results highlight the effectiveness of leveraging negative prototypes and sparse depth for robust and efficient traversability segmentation.

A Contrastive Fewshot RGBD Traversability Segmentation Framework for Indoor Robotic Navigation

TL;DR

A multi-modal segmentation framework that leverages RGB images and sparse 1D laser depth information to capture geometric interactions and improve the detection of challenging obstacles and highlights the effectiveness of leveraging negative prototypes and sparse depth for robust and efficient traversability segmentation.

Abstract

Indoor traversability segmentation aims to identify safe, navigable free space for autonomous agents, which is critical for robotic navigation. Pure vision-based models often fail to detect thin obstacles, such as chair legs, which can pose serious safety risks. We propose a multi-modal segmentation framework that leverages RGB images and sparse 1D laser depth information to capture geometric interactions and improve the detection of challenging obstacles. To reduce the reliance on large labeled datasets, we adopt the few-shot segmentation (FSS) paradigm, enabling the model to generalize from limited annotated examples. Traditional FSS methods focus solely on positive prototypes, often leading to overfitting to the support set and poor generalization. To address this, we introduce a negative contrastive learning (NCL) branch that leverages negative prototypes (obstacles) to refine free-space predictions. Additionally, we design a two-stage attention depth module to align 1D depth vectors with RGB images both horizontally and vertically. Extensive experiments on our custom-collected indoor RGB-D traversability dataset demonstrate that our method outperforms state-of-the-art FSS and RGB-D segmentation baselines, achieving up to 9\% higher mIoU under both 1-shot and 5-shot settings. These results highlight the effectiveness of leveraging negative prototypes and sparse depth for robust and efficient traversability segmentation.
Paper Structure (19 sections, 6 equations, 5 figures, 4 tables)

This paper contains 19 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of our RGB-D few-shot segmentation framework. The support and query inputs consist of RGB images ($S_{RGB}$, $Q_{RGB}$) and depth vectors ($S_{D}$, $Q_{D}$), encoded separately and fused via a multi-modal fusion block to produce support and query features ($s$, $q$). The support mask $S_{mask}$ is used to mask-pool on $s$, yielding both positive and negative prototypes ($s^+$, $s^-$). The query feature $q$ is then refined into free-space and obstacle representations ($q^+$, $q^-$), which are concatenated and digested by a lightweight decoder to generate the final query mask ($q_{mask}$).
  • Figure 2: Proposed contrastive few-shot RGB-D segmentation framework. RGB and depth inputs are embedded with modality-specific backbones, fused into unified support and query features, and refined through prototype-based contrastive learning. A lightweight decoder then predicts the query segmentation mask for indoor freespaces.
  • Figure 3: Two-stage attention depth backbone. It transforms 1D depth vectors into spatially aligned embeddings by applying horizontal attention (beam alignment) followed by vertical attention (height projection), producing refined depth features for multi-modal fusion.
  • Figure 4: Summit-XL Steel platform
  • Figure 5: Qualitative results on indoor traversability segmentation. Each row shows (1) the query RGB image, (2) the corresponding depth vector, (3) predictions without the two-stage depth attention module and without NCL, (4) predictions with the depth module but without NCL, and (5) predictions from the full model. The proposed depth module helps separate floors from walls/ceilings, while the NCL branch further improves recognition of thin obstacles (e.g., chair legs).