Table of Contents
Fetching ...

InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment

Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, Hao Dong

TL;DR

This work tackles zero-shot generic instruction navigation in unexplored environments by introducing InstructNav, a training-free framework that unifies diverse instruction types through Dynamic Chain-of-Navigation (DCoN). DCoN is complemented by Multi-sourced Value Maps (Semantic, Action, Trajectory) and an Intuition Value Map using multimodal large models to convert linguistic plans into robot actions, enabling dynamic planning and exploration without maps. The approach achieves state-of-the-art zero-shot performance on R2R-CE, Habitat ObjNav, and DDN, and is validated with real-robot experiments across varied indoor scenes. The results demonstrate the potential of grounding language-driven navigation in a modular, model-augmented planning architecture for robust, versatile human-robot interaction in unknown environments.

Abstract

Enabling robots to navigate following diverse language instructions in unexplored environments is an attractive goal for human-robot interaction. However, this goal is challenging because different navigation tasks require different strategies. The scarcity of instruction navigation data hinders training an instruction navigation model with varied strategies. Therefore, previous methods are all constrained to one specific type of navigation instruction. In this work, we propose InstructNav, a generic instruction navigation system. InstructNav makes the first endeavor to handle various instruction navigation tasks without any navigation training or pre-built maps. To reach this goal, we introduce Dynamic Chain-of-Navigation (DCoN) to unify the planning process for different types of navigation instructions. Furthermore, we propose Multi-sourced Value Maps to model key elements in instruction navigation so that linguistic DCoN planning can be converted into robot actionable trajectories. With InstructNav, we complete the R2R-CE task in a zero-shot way for the first time and outperform many task-training methods. Besides, InstructNav also surpasses the previous SOTA method by 10.48% on the zero-shot Habitat ObjNav and by 86.34% on demand-driven navigation DDN. Real robot experiments on diverse indoor scenes further demonstrate our method's robustness in coping with the environment and instruction variations.

InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment

TL;DR

This work tackles zero-shot generic instruction navigation in unexplored environments by introducing InstructNav, a training-free framework that unifies diverse instruction types through Dynamic Chain-of-Navigation (DCoN). DCoN is complemented by Multi-sourced Value Maps (Semantic, Action, Trajectory) and an Intuition Value Map using multimodal large models to convert linguistic plans into robot actions, enabling dynamic planning and exploration without maps. The approach achieves state-of-the-art zero-shot performance on R2R-CE, Habitat ObjNav, and DDN, and is validated with real-robot experiments across varied indoor scenes. The results demonstrate the potential of grounding language-driven navigation in a modular, model-augmented planning architecture for robust, versatile human-robot interaction in unknown environments.

Abstract

Enabling robots to navigate following diverse language instructions in unexplored environments is an attractive goal for human-robot interaction. However, this goal is challenging because different navigation tasks require different strategies. The scarcity of instruction navigation data hinders training an instruction navigation model with varied strategies. Therefore, previous methods are all constrained to one specific type of navigation instruction. In this work, we propose InstructNav, a generic instruction navigation system. InstructNav makes the first endeavor to handle various instruction navigation tasks without any navigation training or pre-built maps. To reach this goal, we introduce Dynamic Chain-of-Navigation (DCoN) to unify the planning process for different types of navigation instructions. Furthermore, we propose Multi-sourced Value Maps to model key elements in instruction navigation so that linguistic DCoN planning can be converted into robot actionable trajectories. With InstructNav, we complete the R2R-CE task in a zero-shot way for the first time and outperform many task-training methods. Besides, InstructNav also surpasses the previous SOTA method by 10.48% on the zero-shot Habitat ObjNav and by 86.34% on demand-driven navigation DDN. Real robot experiments on diverse indoor scenes further demonstrate our method's robustness in coping with the environment and instruction variations.
Paper Structure (29 sections, 7 equations, 4 figures, 5 tables)

This paper contains 29 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: InstructNav can follow different types of navigation instructions in diverse indoor scenes.
  • Figure 2: The workflow of Dynamic Chain-of-Navigation (DCoN). Different types of navigation instructions can be unified into DCoN by LLM. The next action and landmarks will be updated based on observed scene objects at every decision step. Beyond extracting actions and landmarks, DCoN achieves semantic label alignment, common-sense reasoning, and environmental exploration for navigation planning.
  • Figure 3: The system framework of InstructNav. The next Action i and Landmarks i are obtained from DCoN. Scene semantic point cloud is created from the RGB-D observation and 2D semantic segmentation. With this information, Multi-sourced Value Maps $m_{a}$, $m_{s}$, $m_{t}$, and $m_{i}$ can be established. Areas with redder colors represent higher $\uparrow$ values, while bluer colors indicate lower $\downarrow$ values. By synthesizing them into a decision-making value map $m$, InstructNav can plan the next waypoint.
  • Figure 4: Effect of $N$ RGB observations.