GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation

Mukul Khanna; Ram Ramrakhya; Gunjan Chhablani; Sriram Yenamandra; Theophile Gervet; Matthew Chang; Zsolt Kira; Devendra Singh Chaplot; Dhruv Batra; Roozbeh Mottaghi

GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation

Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, Roozbeh Mottaghi

TL;DR

GOAT-Bench introduces the Go to Any Thing (GOAT) task to study universal, multi-modal lifelong navigation where targets are specified by category, language, or image in open vocabulary. It provides a reproducible benchmark combining modular and end-to-end approaches, with and without memory, across three modalities and lifelong episode structure, using HM3DSem-derived scenes and Open-Vocabulary goals. The study finds that memory-enabled methods significantly improve navigation efficiency, while end-to-end RL can achieve strong success rates but often at the cost of efficiency; CLIP alone struggles with instance-specific goals, whereas CroCo-v2-based representations help image-goal navigation. Overall, GOAT-Bench reveals the importance of robust memory representations and modality-aware goal encoding for practical lifelong navigation and sets a foundation for future universal navigation systems.

Abstract

The Embodied AI community has made significant strides in visual navigation tasks, exploring targets from 3D coordinates, objects, language descriptions, and images. However, these navigation models often handle only a single input modality as the target. With the progress achieved so far, it is time to move towards universal navigation models capable of handling various goal types, enabling more effective user interaction with robots. To facilitate this goal, we propose GOAT-Bench, a benchmark for the universal navigation task referred to as GO to AnyThing (GOAT). In this task, the agent is directed to navigate to a sequence of targets specified by the category name, language description, or image in an open-vocabulary fashion. We benchmark monolithic RL and modular methods on the GOAT task, analyzing their performance across modalities, the role of explicit and implicit scene memories, their robustness to noise in goal specifications, and the impact of memory in lifelong scenarios.

GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation

TL;DR

Abstract

Paper Structure (22 sections, 11 figures, 5 tables)

This paper contains 22 sections, 11 figures, 5 tables.

Introduction
Related work
Task
Dataset
Baselines
Modular Baseline
SenseAct-NN Baselines
Results
Modular vs. SenseAct-NN approaches
Analysis
How do agents perform on each modality?
How important is memory for efficient navigation?
Does success and efficiency improve over time?
How robust are these methods to noise in goal specifications?
Conclusion
...and 7 more sections

Figures (11)

Figure 1: We study the Go to Any Thing (GOAT) task, which involves agents navigating to a sequence of open vocabulary goals specified through any of the three modalities â€“ category name, a language description, or an image. We propose GOAT-Bench, a benchmark for the GOAT task, where we evaluate modular and monolithic, explicit and implicit map-based navigation approaches. In the above example, we task the agent with sequentially navigating to 1) a recliner chair (from a closed set of k categories), 2) the oven shown in the picture, 3) "the white book on the coffee table in the living room", and some other objects in the scene. The goal of the benchmark is to facilitate progress towards building such universal, multi-modal, lifelong agents.
Figure 2: Preview of the GOAT-Bench dataset. We show multi-modal examples of goal instances from the dataset: images of objects (blue), language descriptions (orange) and object category annotations (green).
Figure 3: LanguageNav dataset generation pipeline. We automatically generate language descriptions for object goals by leveraging VLMs, LLMs and ground truth information from simulator. We first capture an image of the goal object from a valid viewpoint. Next, we retrieve spatial and semantic information of the nearby objects from the simulator. We then prompt BLIP-2 li2023blip2 to extract appearance attributes of the object. These are then combined to prompt ChatGPT-3.5 to output a language description of the goal.
Figure 4: Performance across types of modalities. We breakdown the performance of all 3 baselines by modalities used subtask type: object category, language or image.
Figure 5: Usefulness of memory: We benchmark the drop in performance for when no memory is maintained across subtasks for modular GOAT chang2023goat and SenseAct-NN Monolithic RL baselines.
...and 6 more figures

GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation

TL;DR

Abstract

GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)