Instance-Specific Image Goal Navigation: Training Embodied Agents to Find Object Instances
Jacob Krantz, Stefan Lee, Jitendra Malik, Dhruv Batra, Devendra Singh Chaplot
TL;DR
The paper tackles the lack of standardization in ImageNav by introducing InstanceImageNav, an instance-focused, embodiment-agnostic image-goal navigation task. It formalizes goal-image criteria, evaluation protocols, agent embodiment, and environment, and provides an HM3D-based benchmark with a diverse object set and a public leaderboard. A model-free PPO baseline demonstrates a large generalization gap, highlighting the need for more robust methods. This work lays the groundwork for consistent, real-world-applicable semantic embodied navigation and fuels community progress through a standardized benchmark.
Abstract
We consider the problem of embodied visual navigation given an image-goal (ImageNav) where an agent is initialized in an unfamiliar environment and tasked with navigating to a location 'described' by an image. Unlike related navigation tasks, ImageNav does not have a standardized task definition which makes comparison across methods difficult. Further, existing formulations have two problematic properties; (1) image-goals are sampled from random locations which can lead to ambiguity (e.g., looking at walls), and (2) image-goals match the camera specification and embodiment of the agent; this rigidity is limiting when considering user-driven downstream applications. We present the Instance-specific ImageNav task (InstanceImageNav) to address these limitations. Specifically, the goal image is 'focused' on some particular object instance in the scene and is taken with camera parameters independent of the agent. We instantiate InstanceImageNav in the Habitat Simulator using scenes from the Habitat-Matterport3D dataset (HM3D) and release a standardized benchmark to measure community progress.
