Table of Contents
Fetching ...

Reliable Semantic Understanding for Real World Zero-shot Object Goal Navigation

Halil Utku Unlu, Shuaihang Yuan, Congcong Wen, Hao Huang, Anthony Tzes, Yi Fang

TL;DR

This work introduces an innovative approach to advancing semantic understanding in zero-shot object goal navigation (ZS-OGN), enhancing the autonomy of robots in unfamiliar environments by employing a dual-component framework that integrates a GLIP Vision Language Model for initial detection and an InstructionBLIP model for validation.

Abstract

We introduce an innovative approach to advancing semantic understanding in zero-shot object goal navigation (ZS-OGN), enhancing the autonomy of robots in unfamiliar environments. Traditional reliance on labeled data has been a limitation for robotic adaptability, which we address by employing a dual-component framework that integrates a GLIP Vision Language Model for initial detection and an InstructionBLIP model for validation. This combination not only refines object and environmental recognition but also fortifies the semantic interpretation, pivotal for navigational decision-making. Our method, rigorously tested in both simulated and real-world settings, exhibits marked improvements in navigation precision and reliability.

Reliable Semantic Understanding for Real World Zero-shot Object Goal Navigation

TL;DR

This work introduces an innovative approach to advancing semantic understanding in zero-shot object goal navigation (ZS-OGN), enhancing the autonomy of robots in unfamiliar environments by employing a dual-component framework that integrates a GLIP Vision Language Model for initial detection and an InstructionBLIP model for validation.

Abstract

We introduce an innovative approach to advancing semantic understanding in zero-shot object goal navigation (ZS-OGN), enhancing the autonomy of robots in unfamiliar environments. Traditional reliance on labeled data has been a limitation for robotic adaptability, which we address by employing a dual-component framework that integrates a GLIP Vision Language Model for initial detection and an InstructionBLIP model for validation. This combination not only refines object and environmental recognition but also fortifies the semantic interpretation, pivotal for navigational decision-making. Our method, rigorously tested in both simulated and real-world settings, exhibits marked improvements in navigation precision and reliability.

Paper Structure

This paper contains 21 sections, 5 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of the key component of our method ZS-OGN. The process begins with the GLIP Vision Language Model detecting the target object, in this case, an espresso machine. Subsequently, the InstructBLIP model evaluates the detection, either confirming the GLIP's proposal ('Agree') or not ('Disagree'), which influences the continuation or adjustment of the navigational plan.
  • Figure 2: The flow of our proposed method in a real-world scenario.
  • Figure 3: Layout of the apartment used in the experiment, along with the first-person views of the detected mug (red), remote (green), and trash can (blue). The starting location is marked with a star.
  • Figure 4: A photo of the Unitree B1 robotic platform.
  • Figure 5: Sample path generated from the medial axis algorithm. The path (light gray) connects the current (magenta) and target (dark blue) coordinates through the medial axis (dark gray), entirely within the traversable region (green) and maximizing distance from obstacles (red). Black denotes lethal cost.
  • ...and 2 more figures