Table of Contents
Fetching ...

GAMap: Zero-Shot Object Goal Navigation with Multi-Scale Geometric-Affordance Guidance

Shuaihang Yuan, Hao Huang, Yu Hao, Congcong Wen, Anthony Tzes, Yi Fang

TL;DR

Improvements in Success Rate and Success weighted by Path Length demonstrate the efficacy of the geometric-part and affordance-guided navigation approach in enhancing robot autonomy and versatility, without any additional object-specific training or fine-tuning with the semantics of unseen objects and/or the locomotions of the robot.

Abstract

Zero-Shot Object Goal Navigation (ZS-OGN) enables robots or agents to navigate toward objects of unseen categories without object-specific training. Traditional approaches often leverage categorical semantic information for navigation guidance, which struggles when only objects are partially observed or detailed and functional representations of the environment are lacking. To resolve the above two issues, we propose \textit{Geometric-part and Affordance Maps} (GAMap), a novel method that integrates object parts and affordance attributes as navigation guidance. Our method includes a multi-scale scoring approach to capture geometric-part and affordance attributes of objects at different scales. Comprehensive experiments conducted on HM3D and Gibson benchmark datasets demonstrate improvements in Success Rate and Success weighted by Path Length, underscoring the efficacy of our geometric-part and affordance-guided navigation approach in enhancing robot autonomy and versatility, without any additional object-specific training or fine-tuning with the semantics of unseen objects and/or the locomotions of the robot.

GAMap: Zero-Shot Object Goal Navigation with Multi-Scale Geometric-Affordance Guidance

TL;DR

Improvements in Success Rate and Success weighted by Path Length demonstrate the efficacy of the geometric-part and affordance-guided navigation approach in enhancing robot autonomy and versatility, without any additional object-specific training or fine-tuning with the semantics of unseen objects and/or the locomotions of the robot.

Abstract

Zero-Shot Object Goal Navigation (ZS-OGN) enables robots or agents to navigate toward objects of unseen categories without object-specific training. Traditional approaches often leverage categorical semantic information for navigation guidance, which struggles when only objects are partially observed or detailed and functional representations of the environment are lacking. To resolve the above two issues, we propose \textit{Geometric-part and Affordance Maps} (GAMap), a novel method that integrates object parts and affordance attributes as navigation guidance. Our method includes a multi-scale scoring approach to capture geometric-part and affordance attributes of objects at different scales. Comprehensive experiments conducted on HM3D and Gibson benchmark datasets demonstrate improvements in Success Rate and Success weighted by Path Length, underscoring the efficacy of our geometric-part and affordance-guided navigation approach in enhancing robot autonomy and versatility, without any additional object-specific training or fine-tuning with the semantics of unseen objects and/or the locomotions of the robot.

Paper Structure

This paper contains 26 sections, 7 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: The leftmost RGB image shows the same observation for both methods. Our method (top row) effectively identifies the geometric part of the chair back, which is missed by the traditional method (bottom row). Consequently, GAMap successfully guides the agent to the target object, while the traditional method fails. The red circles highlight the areas where the chair is located, and the GA score is high, indicating the effectiveness of our approach in localizing relevant regions.
  • Figure 2: Pipeline of the GAMap generation. Geometric parts and affordance attributes are generated by an LLM. The RGB observation is partitioned into multiple scales, and a CLIP visual encoder generates multi-scale visual embeddings. GA scores are computed using cosine similarity between attribute text embeddings from a CLIP text encoder and the multi-scale visual embeddings. These scores are averaged and projected onto a 2D grid to form the GAMap.
  • Figure 3: Heatmap showing the increase and decrease in the percentage of SR and time cost for varying the numbers of $N_a$ and $N_g$. Darker colors indicate a greater decrease in SR, and red solid and dashed lines represent the associated time cost.
  • Figure 4: Changes in SR, SPL, and processing time across different scaling levels on the mini-validation split of HM3D. Increasing scales improves SR and SPL but also increases processing time.
  • Figure 5: Comparison of GA score visualization between gradient-based and patch-based methods for the armrest, backrest, and seat attributes of a target chair. The gradient-based method (top row) often attends to irrelevant areas, such as the ceiling, while the patch-based method (bottom row) accurately focuses on the relevant areas.
  • ...and 3 more figures