Table of Contents
Fetching ...

Instance-Specific Image Goal Navigation: Training Embodied Agents to Find Object Instances

Jacob Krantz, Stefan Lee, Jitendra Malik, Dhruv Batra, Devendra Singh Chaplot

TL;DR

The paper tackles the lack of standardization in ImageNav by introducing InstanceImageNav, an instance-focused, embodiment-agnostic image-goal navigation task. It formalizes goal-image criteria, evaluation protocols, agent embodiment, and environment, and provides an HM3D-based benchmark with a diverse object set and a public leaderboard. A model-free PPO baseline demonstrates a large generalization gap, highlighting the need for more robust methods. This work lays the groundwork for consistent, real-world-applicable semantic embodied navigation and fuels community progress through a standardized benchmark.

Abstract

We consider the problem of embodied visual navigation given an image-goal (ImageNav) where an agent is initialized in an unfamiliar environment and tasked with navigating to a location 'described' by an image. Unlike related navigation tasks, ImageNav does not have a standardized task definition which makes comparison across methods difficult. Further, existing formulations have two problematic properties; (1) image-goals are sampled from random locations which can lead to ambiguity (e.g., looking at walls), and (2) image-goals match the camera specification and embodiment of the agent; this rigidity is limiting when considering user-driven downstream applications. We present the Instance-specific ImageNav task (InstanceImageNav) to address these limitations. Specifically, the goal image is 'focused' on some particular object instance in the scene and is taken with camera parameters independent of the agent. We instantiate InstanceImageNav in the Habitat Simulator using scenes from the Habitat-Matterport3D dataset (HM3D) and release a standardized benchmark to measure community progress.

Instance-Specific Image Goal Navigation: Training Embodied Agents to Find Object Instances

TL;DR

The paper tackles the lack of standardization in ImageNav by introducing InstanceImageNav, an instance-focused, embodiment-agnostic image-goal navigation task. It formalizes goal-image criteria, evaluation protocols, agent embodiment, and environment, and provides an HM3D-based benchmark with a diverse object set and a public leaderboard. A model-free PPO baseline demonstrates a large generalization gap, highlighting the need for more robust methods. This work lays the groundwork for consistent, real-world-applicable semantic embodied navigation and fuels community progress through a standardized benchmark.

Abstract

We consider the problem of embodied visual navigation given an image-goal (ImageNav) where an agent is initialized in an unfamiliar environment and tasked with navigating to a location 'described' by an image. Unlike related navigation tasks, ImageNav does not have a standardized task definition which makes comparison across methods difficult. Further, existing formulations have two problematic properties; (1) image-goals are sampled from random locations which can lead to ambiguity (e.g., looking at walls), and (2) image-goals match the camera specification and embodiment of the agent; this rigidity is limiting when considering user-driven downstream applications. We present the Instance-specific ImageNav task (InstanceImageNav) to address these limitations. Specifically, the goal image is 'focused' on some particular object instance in the scene and is taken with camera parameters independent of the agent. We instantiate InstanceImageNav in the Habitat Simulator using scenes from the Habitat-Matterport3D dataset (HM3D) and release a standardized benchmark to measure community progress.
Paper Structure (9 sections, 4 equations, 6 figures)

This paper contains 9 sections, 4 equations, 6 figures.

Figures (6)

  • Figure 1: We present InstanceImageNav where an agent is tasked with navigating to the object depicted by a goal image. The goal camera is reflective of the task issuer, not the task executor.
  • Figure 2: Goal image generation: In this example, the target is an armchair. We sample candidate camera parameters radially about the object. For each candidate's RGBD+Semantic render, we compute object coverage (how much of the object is seen) and frame coverage (how much of the image is the object). We threshold these values to select goal images with clear and natural views of the target object.
  • Figure 3: Distributions of object categories and image goals in the InstanceImageNav-HM3D Train split. Left: distribution of objects at each stage of image filtering. Right: average number of image goals per object instance at each stage of image filtering.
  • Figure 4: Example goal images in InstanceImageNav-HM3D for each object category. Reflecting how the task may be issued in the real world, these images display a wide diversity of capture settings, including variable camera heights, field of view, and distance to the object.
  • Figure 5: Shortest path statistics for the InstanceImageNav-HM3D dataset. Episodes from the training split (top) and the validation split (bottom) are compared along the axes of Euclidean distance from start to goal (left), the geodesic distance along the shortest path (center), and the ratio of geodesic distance to Euclidean distance (right). Euclidean and geodesic distances are in meters.
  • ...and 1 more figures