Table of Contents
Fetching ...

Instance-aware Exploration-Verification-Exploitation for Instance ImageGoal Navigation

Xiaohan Lei, Min Wang, Wengang Zhou, Li Li, Houqiang Li

TL;DR

This work designs a new modular navigation framework named Instance-aware Exploration-Verification-Exploitation (IEVE) for instancelevel image goal navigation that allows for active switching among the exploration, verification, and exploitation actions, thereby facilitating the agent in making reasonable decisions under different situations.

Abstract

As a new embodied vision task, Instance ImageGoal Navigation (IIN) aims to navigate to a specified object depicted by a goal image in an unexplored environment. The main challenge of this task lies in identifying the target object from different viewpoints while rejecting similar distractors. Existing ImageGoal Navigation methods usually adopt the simple Exploration-Exploitation framework and ignore the identification of specific instance during navigation. In this work, we propose to imitate the human behaviour of ``getting closer to confirm" when distinguishing objects from a distance. Specifically, we design a new modular navigation framework named Instance-aware Exploration-Verification-Exploitation (IEVE) for instance-level image goal navigation. Our method allows for active switching among the exploration, verification, and exploitation actions, thereby facilitating the agent in making reasonable decisions under different situations. On the challenging HabitatMatterport 3D semantic (HM3D-SEM) dataset, our method surpasses previous state-of-the-art work, with a classical segmentation model (0.684 vs. 0.561 success) or a robust model (0.702 vs. 0.561 success)

Instance-aware Exploration-Verification-Exploitation for Instance ImageGoal Navigation

TL;DR

This work designs a new modular navigation framework named Instance-aware Exploration-Verification-Exploitation (IEVE) for instancelevel image goal navigation that allows for active switching among the exploration, verification, and exploitation actions, thereby facilitating the agent in making reasonable decisions under different situations.

Abstract

As a new embodied vision task, Instance ImageGoal Navigation (IIN) aims to navigate to a specified object depicted by a goal image in an unexplored environment. The main challenge of this task lies in identifying the target object from different viewpoints while rejecting similar distractors. Existing ImageGoal Navigation methods usually adopt the simple Exploration-Exploitation framework and ignore the identification of specific instance during navigation. In this work, we propose to imitate the human behaviour of ``getting closer to confirm" when distinguishing objects from a distance. Specifically, we design a new modular navigation framework named Instance-aware Exploration-Verification-Exploitation (IEVE) for instance-level image goal navigation. Our method allows for active switching among the exploration, verification, and exploitation actions, thereby facilitating the agent in making reasonable decisions under different situations. On the challenging HabitatMatterport 3D semantic (HM3D-SEM) dataset, our method surpasses previous state-of-the-art work, with a classical segmentation model (0.684 vs. 0.561 success) or a robust model (0.702 vs. 0.561 success)
Paper Structure (14 sections, 6 equations, 7 figures, 2 tables)

This paper contains 14 sections, 6 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Instance ImageGoal Navigation tasks an agent with navigating to a particular object instance described by the goal image. The agent is initially in an unexplored environment at $t_1$. While exploring the environment, it encounters a "bed" similar to the target at $t_2$ and discerns their differences. Eventually, at $t_3$, agent finds the "bed" described in the goal image.
  • Figure 2: Framework Overview. Our model consists of five main components. Instance Classification $f_{class}$ predicts the object's class in goal image $n$. Online Mapping $f_{SLAM}$ uses RGB-D and sensor pose reading $\mathcal{P}(t)$ to construct a semantic map $\mathcal{M}(t)$ of the environment. The Switch Policy $\pi _S$ and Goal Mapping Policy $\pi _{GM}$ are interconnected. The Switch Policy $\pi _S$ determines the output of the Goal Mapping Policy's goal map $\mathcal{M}_g(t)$ based on its input judgments (whether a potential target exists and whether the potential target is confirmed). Once the goal map $\mathcal{M}_g(t)$ is determined, the Local Policy $\pi _l$ is used to determine the action $a(t)$ taken by the agent at the current timestep.
  • Figure 3: Switch Policy and Goal Mapping Policy. The Goal Map Selection module $f_{switch}$ receives two inputs: the shortest Euclidean distance between the agent and the potential target (or non-existence), and the number of matched keypoints. These inputs are mapped to a selection signal using Goal Map Selection function $f_{switch}$. At each timestep, the Switch Policy $\pi _S$ will choose one and only one of the three parallel modules of Goal Mapping Policy $\pi _{GM}$.
  • Figure 4: Goal Map Selection function $f_{switch}$ with respect to the Euclidean distance from agent to the potential target and the number of matched keypoints. Best viewed in color.
  • Figure 5: Qualitative example of our IEVE agent performing the Instance ImageGoal Navigation task in the Habitat simulator. Agent is initialized at $T=1$ , and finds a potential target at $T=5$. After carefully evaluating the potential target from $T=5$ to $T=12$, the agent proceeds with its exploration. Finally, agents identifies the goal bed instance at $T=95$.
  • ...and 2 more figures