Table of Contents
Fetching ...

Zero-Shot Object Goal Visual Navigation With Class-Independent Relationship Network

Xinting Li, Shiguang Zhang, Yue LU, Kerry Dang, Lingyan Ran

TL;DR

The paper addresses zero-shot object goal visual navigation, where targets are unseen during training. It proposes Class-Independent Relationship Network (CIRN), which builds a target-agnostic state from object detections and semantic similarity to the target, then uses a Graph Convolutional Network to model object relationships followed by an LSTM with an Actor-Critic policy for action selection. A key contribution is decoupling navigation capability from target features, enabling robust zero-shot generalization across cross-target and cross-scene settings in AI2-THOR. Results show CIRN outperforms state-of-the-art zero-shot baselines, with code available for reproduction. This work advances practical embodied AI by reducing reliance on target-specific training data and improving generalization to unseen objects.

Abstract

This paper investigates the zero-shot object goal visual navigation problem. In the object goal visual navigation task, the agent needs to locate navigation targets from its egocentric visual input. "Zero-shot" means that the target the agent needs to find is not trained during the training phase. To address the issue of coupling navigation ability with target features during training, we propose the Class-Independent Relationship Network (CIRN). This method combines target detection information with the relative semantic similarity between the target and the navigation target, and constructs a brand new state representation based on similarity ranking, this state representation does not include target feature or environment feature, effectively decoupling the agent's navigation ability from target features. And a Graph Convolutional Network (GCN) is employed to learn the relationships between different objects based on their similarities. During testing, our approach demonstrates strong generalization capabilities, including zero-shot navigation tasks with different targets and environments. Through extensive experiments in the AI2-THOR virtual environment, our method outperforms the current state-of-the-art approaches in the zero-shot object goal visual navigation task. Furthermore, we conducted experiments in more challenging cross-target and cross-scene settings, which further validate the robustness and generalization ability of our method. Our code is available at: https://github.com/SmartAndCleverRobot/ICRA-CIRN.

Zero-Shot Object Goal Visual Navigation With Class-Independent Relationship Network

TL;DR

The paper addresses zero-shot object goal visual navigation, where targets are unseen during training. It proposes Class-Independent Relationship Network (CIRN), which builds a target-agnostic state from object detections and semantic similarity to the target, then uses a Graph Convolutional Network to model object relationships followed by an LSTM with an Actor-Critic policy for action selection. A key contribution is decoupling navigation capability from target features, enabling robust zero-shot generalization across cross-target and cross-scene settings in AI2-THOR. Results show CIRN outperforms state-of-the-art zero-shot baselines, with code available for reproduction. This work advances practical embodied AI by reducing reliance on target-specific training data and improving generalization to unseen objects.

Abstract

This paper investigates the zero-shot object goal visual navigation problem. In the object goal visual navigation task, the agent needs to locate navigation targets from its egocentric visual input. "Zero-shot" means that the target the agent needs to find is not trained during the training phase. To address the issue of coupling navigation ability with target features during training, we propose the Class-Independent Relationship Network (CIRN). This method combines target detection information with the relative semantic similarity between the target and the navigation target, and constructs a brand new state representation based on similarity ranking, this state representation does not include target feature or environment feature, effectively decoupling the agent's navigation ability from target features. And a Graph Convolutional Network (GCN) is employed to learn the relationships between different objects based on their similarities. During testing, our approach demonstrates strong generalization capabilities, including zero-shot navigation tasks with different targets and environments. Through extensive experiments in the AI2-THOR virtual environment, our method outperforms the current state-of-the-art approaches in the zero-shot object goal visual navigation task. Furthermore, we conducted experiments in more challenging cross-target and cross-scene settings, which further validate the robustness and generalization ability of our method. Our code is available at: https://github.com/SmartAndCleverRobot/ICRA-CIRN.
Paper Structure (7 sections, 1 equation, 2 figures, 5 tables)

This paper contains 7 sections, 1 equation, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The motivation of this method is to decouple the navigation ability of the agent from the navigation target, as shown in the above figure. The red marked object in the figure represents the navigation target, the yellow marked other object, the numbers on the bounding box refer to the semantic similarity between this object and the navigation target, the navigation target in the left figure is the toaster, and the navigation target in the right figure is the Laptop. Although the test environment and the test target are different from the training, However, in the state set by CIRN, is not contain the feature of the object or environment, and the difference between the two is only in spatial location and semantic similarity.
  • Figure 2: Architecture Overview: The model receives input consisting of object detection information for different classes within the field of view, as well as their semantic similarity to the target, sorted in descending order according to semantic similarity. This input processes through the GCN module before being fed into the LSTM network. The LSTM module is responsible for extracting and retaining past action information. Ultimately, an actor-critic network is used to generate actions.