Table of Contents
Fetching ...

Zero-shot Object-Centric Instruction Following: Integrating Foundation Models with Traditional Navigation

Sonia Raychaudhuri, Duy Ta, Katrina Ashton, Angel X. Chang, Jiuguang Wang, Bernadette Bucher

TL;DR

This work tackles zero-shot instruction following in unseen environments by grounding natural language directives in a language-derived factor-graph prior that guides SLAM-based navigation. The proposed LIFGIF framework first uses an LLM to generate a language-inferred graph from the instruction and then iteratively refines this graph online with observations using a factor-graph SLAM approach, implemented via GTSAM. It introduces the OC-VLN dataset for object-centric navigation and demonstrates superior zero-shot performance against state-of-the-art baselines in Habitat, with a real-world Boston Dynamics Spot demonstration. The approach effectively bridges linguistic guidance and spatial grounding, enabling robust object-centric instruction following without task-specific training.

Abstract

Large scale scenes such as multifloor homes can be robustly and efficiently mapped with a 3D graph of landmarks estimated jointly with robot poses in a factor graph, a technique commonly used in commercial robots such as drones and robot vacuums. In this work, we propose Language-Inferred Factor Graph for Instruction Following (LIFGIF), a zero-shot method to ground natural language instructions in such a map. LIFGIF also includes a policy for following natural language navigation instructions in a novel environment while the map is constructed, enabling robust navigation performance in the physical world. To evaluate LIFGIF, we present a new dataset, Object-Centric VLN (OC-VLN), in order to evaluate grounding of object-centric natural language navigation instructions. We compare to two state-of-the-art zero-shot baselines from related tasks, Object Goal Navigation and Vision Language Navigation, to demonstrate that LIFGIF outperforms them across all our evaluation metrics on OCVLN. Finally, we successfully demonstrate the effectiveness of LIFGIF for performing zero-shot object-centric instruction following in the real world on a Boston Dynamics Spot robot.

Zero-shot Object-Centric Instruction Following: Integrating Foundation Models with Traditional Navigation

TL;DR

This work tackles zero-shot instruction following in unseen environments by grounding natural language directives in a language-derived factor-graph prior that guides SLAM-based navigation. The proposed LIFGIF framework first uses an LLM to generate a language-inferred graph from the instruction and then iteratively refines this graph online with observations using a factor-graph SLAM approach, implemented via GTSAM. It introduces the OC-VLN dataset for object-centric navigation and demonstrates superior zero-shot performance against state-of-the-art baselines in Habitat, with a real-world Boston Dynamics Spot demonstration. The approach effectively bridges linguistic guidance and spatial grounding, enabling robust object-centric instruction following without task-specific training.

Abstract

Large scale scenes such as multifloor homes can be robustly and efficiently mapped with a 3D graph of landmarks estimated jointly with robot poses in a factor graph, a technique commonly used in commercial robots such as drones and robot vacuums. In this work, we propose Language-Inferred Factor Graph for Instruction Following (LIFGIF), a zero-shot method to ground natural language instructions in such a map. LIFGIF also includes a policy for following natural language navigation instructions in a novel environment while the map is constructed, enabling robust navigation performance in the physical world. To evaluate LIFGIF, we present a new dataset, Object-Centric VLN (OC-VLN), in order to evaluate grounding of object-centric natural language navigation instructions. We compare to two state-of-the-art zero-shot baselines from related tasks, Object Goal Navigation and Vision Language Navigation, to demonstrate that LIFGIF outperforms them across all our evaluation metrics on OCVLN. Finally, we successfully demonstrate the effectiveness of LIFGIF for performing zero-shot object-centric instruction following in the real world on a Boston Dynamics Spot robot.

Paper Structure

This paper contains 11 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Using an LLM we extract waypoints, landmarks, waypoint-waypoint actions and landmark-waypoint relations.
  • Figure 2: Language-inferred factor graph corresponding to the language instruction "Go forward to the piano. Turn right. Stop at the table".
  • Figure 3: LIFGIF in action. Visualizing the progress of our agent through an episode shows how the language-inferred graph gets optimized over time $t$ by making observations in the real-world and performing data association, leading to a successful completion. (left) shows detected objects; (right) shows current robot pose, ground-truth trajectory (green), inferred waypoints (yellow) and landmarks (blue).
  • Figure 4: Comparison. Our agent (right) trajectory aligns with the ground-truth trajectory better than Seq-VLFM (left), indicating better instruction following ability. Agent trajectory is blue, ground-truth path is green and target object is red.
  • Figure 5: Failure cases. Two frequent failure cases in LIFGIF are: (top) multiple instances of the same landmark present, e.g. 'door'; and (bottom) an object is mis-classified e.g. 'television' identified as 'fireplace'.