Zero-shot Object-Centric Instruction Following: Integrating Foundation Models with Traditional Navigation
Sonia Raychaudhuri, Duy Ta, Katrina Ashton, Angel X. Chang, Jiuguang Wang, Bernadette Bucher
TL;DR
This work tackles zero-shot instruction following in unseen environments by grounding natural language directives in a language-derived factor-graph prior that guides SLAM-based navigation. The proposed LIFGIF framework first uses an LLM to generate a language-inferred graph from the instruction and then iteratively refines this graph online with observations using a factor-graph SLAM approach, implemented via GTSAM. It introduces the OC-VLN dataset for object-centric navigation and demonstrates superior zero-shot performance against state-of-the-art baselines in Habitat, with a real-world Boston Dynamics Spot demonstration. The approach effectively bridges linguistic guidance and spatial grounding, enabling robust object-centric instruction following without task-specific training.
Abstract
Large scale scenes such as multifloor homes can be robustly and efficiently mapped with a 3D graph of landmarks estimated jointly with robot poses in a factor graph, a technique commonly used in commercial robots such as drones and robot vacuums. In this work, we propose Language-Inferred Factor Graph for Instruction Following (LIFGIF), a zero-shot method to ground natural language instructions in such a map. LIFGIF also includes a policy for following natural language navigation instructions in a novel environment while the map is constructed, enabling robust navigation performance in the physical world. To evaluate LIFGIF, we present a new dataset, Object-Centric VLN (OC-VLN), in order to evaluate grounding of object-centric natural language navigation instructions. We compare to two state-of-the-art zero-shot baselines from related tasks, Object Goal Navigation and Vision Language Navigation, to demonstrate that LIFGIF outperforms them across all our evaluation metrics on OCVLN. Finally, we successfully demonstrate the effectiveness of LIFGIF for performing zero-shot object-centric instruction following in the real world on a Boston Dynamics Spot robot.
