Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration for Zero-Shot Object Navigation

Vishnu Sashank Dorbala; James F. Mullen; Dinesh Manocha

Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration for Zero-Shot Object Navigation

Vishnu Sashank Dorbala, James F. Mullen, Dinesh Manocha

TL;DR

The paper tackles zero-shot object navigation with unconstrained natural language by introducing LGX, a framework that uses LLM-driven exploration guided by scene-derived prompts alongside GLIP-based open-vocabulary grounding to locate uniquely described objects in unseen environments. It demonstrates state-of-the-art results on RoboTHOR, with substantial SR/SPL gains, and validates the approach in real-world robot experiments using a TurtleBot 2. The work conducts extensive analyses of LLM prompting strategies and grounding model choices, providing insights into prompt design and grounding reliability. It also discusses practical limitations and avenues for future work to improve robustness in real-world, free-language navigation scenarios.

Abstract

We present LGX (Language-guided Exploration), a novel algorithm for Language-Driven Zero-Shot Object Goal Navigation (L-ZSON), where an embodied agent navigates to a uniquely described target object in a previously unseen environment. Our approach makes use of Large Language Models (LLMs) for this task by leveraging the LLM's commonsense reasoning capabilities for making sequential navigational decisions. Simultaneously, we perform generalized target object detection using a pre-trained Vision-Language grounding model. We achieve state-of-the-art zero-shot object navigation results on RoboTHOR with a success rate (SR) improvement of over 27% over the current baseline of the OWL-ViT CLIP on Wheels (OWL CoW). Furthermore, we study the usage of LLMs for robot navigation and present an analysis of various prompting strategies affecting the model output. Finally, we showcase the benefits of our approach via \textit{real-world} experiments that indicate the superior performance of LGX in detecting and navigating to visually unique objects.

Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration for Zero-Shot Object Navigation

TL;DR

Abstract

Paper Structure (25 sections, 2 equations, 8 figures, 5 tables)

This paper contains 25 sections, 2 equations, 8 figures, 5 tables.

Introduction
Related Work
Language-Guided Robotics
Language-Driven Zero-Shot Navigation
Language-Guided Scene Manipulation
LLMs for High-Level Task Planning
LLMs for Language-Guided Navigation
Solving L-ZSON using Language-Guided Exploration (LGX)
Method Overview
Scene Understanding
Intelligent Exploration with Large Language Models
Goal Detection and Motion Planning
Analyzing our Approach
Using GLIP for Zero-Shot Detection
Examining LLM Prompts for Exploration
...and 10 more sections

Figures (8)

Figure 1: LLM-Based Navigation: Our method, LGX approaches the problem of Language-driven Zero-Shot Object Navigation or L-ZSON. To navigate to and detect an unseen, arbitrarily described object class in an unknown environment, we first extract visual semantic information about the environment. This information is utilized to develop a prompt for the Large Language Model (LLM), whose output provides us with either object sub-goals or cartesian directions to guide the embodied agent towards the target. Meanwhile, GLIP searches for the environment for the target object, which in this case is a "cat-shaped mug".
Figure 2: An overview of our approach. We first gather observational data from the environment by performing a 360 degree rotation to obtain depth and RGB images around the agent. The RGB images give us semantic information about the objects in the agent's view, while the depth image allows us to create a costmap. We then synthesize prompts for the LLM by utilizing the extracted object labels. Finally, the LLM drives the navigational scheme by producing an output from the object list, which tells the agent which direction to head towards. Simultaneously, we attempt to ground the target object in the scene with GLIP. When the target is found, we exit the decision making loop and navigate directly to it.
Figure 3: An example of GLIP output when fed with the input string "Cat-shaped mug . Cat . Mug" on the image given. GLIP can successfully locate a unique object, like a "cat-shaped mug" and differentiate between it and related objects like a cat or a mug.
Figure 4: To validate the LLM's exploration capability, we define a two-phase process. The target object is present in a different room, requiring the agent to navigate out of the current room into a 'hallway'. The LLM in LGX takes objects in the current room along with the hallway as input to the LLM. Not reaching the 'hallway' is a Phase 1 failure. For Phase 2, four possible rooms are visible and the agent must navigate to the room with the goal object. We pass a set of common objects for each room as shown in Table \ref{['tab:real-layouts']} as input to the LLM in LGX. Not choosing the correct room is considered a Phase 2 failure.
Figure 5: The class breakdown of LGX versus the OWL CoW and original CoW on RoboTHOR. LGX provides a strong improvement in localizing the baseball bat, bowl, laptop, spray bottle, and vase classes. Similar performance is noted on larger classes like television and garbage can.
...and 3 more figures

Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration for Zero-Shot Object Navigation

TL;DR

Abstract

Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration for Zero-Shot Object Navigation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)