Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration for Zero-Shot Object Navigation
Vishnu Sashank Dorbala, James F. Mullen, Dinesh Manocha
TL;DR
The paper tackles zero-shot object navigation with unconstrained natural language by introducing LGX, a framework that uses LLM-driven exploration guided by scene-derived prompts alongside GLIP-based open-vocabulary grounding to locate uniquely described objects in unseen environments. It demonstrates state-of-the-art results on RoboTHOR, with substantial SR/SPL gains, and validates the approach in real-world robot experiments using a TurtleBot 2. The work conducts extensive analyses of LLM prompting strategies and grounding model choices, providing insights into prompt design and grounding reliability. It also discusses practical limitations and avenues for future work to improve robustness in real-world, free-language navigation scenarios.
Abstract
We present LGX (Language-guided Exploration), a novel algorithm for Language-Driven Zero-Shot Object Goal Navigation (L-ZSON), where an embodied agent navigates to a uniquely described target object in a previously unseen environment. Our approach makes use of Large Language Models (LLMs) for this task by leveraging the LLM's commonsense reasoning capabilities for making sequential navigational decisions. Simultaneously, we perform generalized target object detection using a pre-trained Vision-Language grounding model. We achieve state-of-the-art zero-shot object navigation results on RoboTHOR with a success rate (SR) improvement of over 27% over the current baseline of the OWL-ViT CLIP on Wheels (OWL CoW). Furthermore, we study the usage of LLMs for robot navigation and present an analysis of various prompting strategies affecting the model output. Finally, we showcase the benefits of our approach via \textit{real-world} experiments that indicate the superior performance of LGX in detecting and navigating to visually unique objects.
