Table of Contents
Fetching ...

TANGO: Training-free Embodied AI Agents for Open-world Tasks

Filippo Ziliotto, Tommaso Campari, Luciano Serafini, Lamberto Ballan

TL;DR

TANGO addresses open-world embodied AI by eliminating task-specific training and instead using a large language model as a planner to compose a set of pre-trained action primitives. It combines a PointGoal navigation module, memory-augmented exploration, and vision-language perception to execute tasks across Open-set ObjectNav, Multi-Modal Lifelong Navigation, and Open Embodied Question Answering in zero-shot settings. The system is modular and neuro-symbolic: the LLM generates explainable pseudo-code that maps to primitives, which are executed by interpretable modules, enabling traceability and easy upgrading. Results show state-of-the-art or competitive performance without fine-tuning, underscoring the potential of LLM-guided modular planning for scalable, zero-shot embodied navigation.

Abstract

Large Language Models (LLMs) have demonstrated excellent capabilities in composing various modules together to create programs that can perform complex reasoning tasks on images. In this paper, we propose TANGO, an approach that extends the program composition via LLMs already observed for images, aiming to integrate those capabilities into embodied agents capable of observing and acting in the world. Specifically, by employing a simple PointGoal Navigation model combined with a memory-based exploration policy as a foundational primitive for guiding an agent through the world, we show how a single model can address diverse tasks without additional training. We task an LLM with composing the provided primitives to solve a specific task, using only a few in-context examples in the prompt. We evaluate our approach on three key Embodied AI tasks: Open-Set ObjectGoal Navigation, Multi-Modal Lifelong Navigation, and Open Embodied Question Answering, achieving state-of-the-art results without any specific fine-tuning in challenging zero-shot scenarios.

TANGO: Training-free Embodied AI Agents for Open-world Tasks

TL;DR

TANGO addresses open-world embodied AI by eliminating task-specific training and instead using a large language model as a planner to compose a set of pre-trained action primitives. It combines a PointGoal navigation module, memory-augmented exploration, and vision-language perception to execute tasks across Open-set ObjectNav, Multi-Modal Lifelong Navigation, and Open Embodied Question Answering in zero-shot settings. The system is modular and neuro-symbolic: the LLM generates explainable pseudo-code that maps to primitives, which are executed by interpretable modules, enabling traceability and easy upgrading. Results show state-of-the-art or competitive performance without fine-tuning, underscoring the potential of LLM-guided modular planning for scalable, zero-shot embodied navigation.

Abstract

Large Language Models (LLMs) have demonstrated excellent capabilities in composing various modules together to create programs that can perform complex reasoning tasks on images. In this paper, we propose TANGO, an approach that extends the program composition via LLMs already observed for images, aiming to integrate those capabilities into embodied agents capable of observing and acting in the world. Specifically, by employing a simple PointGoal Navigation model combined with a memory-based exploration policy as a foundational primitive for guiding an agent through the world, we show how a single model can address diverse tasks without additional training. We task an LLM with composing the provided primitives to solve a specific task, using only a few in-context examples in the prompt. We evaluate our approach on three key Embodied AI tasks: Open-Set ObjectGoal Navigation, Multi-Modal Lifelong Navigation, and Open Embodied Question Answering, achieving state-of-the-art results without any specific fine-tuning in challenging zero-shot scenarios.

Paper Structure

This paper contains 22 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We introduce TANGO, a modular neuro-symbolic system for compositional embodied visual navigation. Given a few examples of natural language instructions and the corresponding programs composed of action primitives, TANGO can generate executable programs, enabling the agent to perform multiple tasks within a 3D environment.
  • Figure 2: Overview of the program generation in TANGO. Given few "in-context examples", the llm provide a detailed sequence of steps to be executed by the agent in the given environment. The llm is instructed to comment its output to allow for explainability.
  • Figure 3: Overview of TANGO modules. Modules span a variety of inputs and outputs. Orange modules use Python subroutines, while blue modules use pre-trained computer vision models (similarly to visprog). The navigate_to and explore_scene modules, in green, both implement our foundational PointNav module; however, only explore_scene integrates the memory mechanism.
  • Figure 4: Examples from OpenEQAmajumdar2024openeqa. The top section illustrates a successful episode where TANGO is able to understands the input query, correctly specifying the sequential targets. The lower section illustrates a failure caused by overly general directions from the llm, which TANGO struggled to resolve.
  • Figure 5: Multi-Modal Lifelong Navigation Success Example. (top) RGB observation of the target during STOP action (step $t_{i+ steps}$). (middle) Value map for the specific target recomputed from the memory map (step $t_{i + steps})$. (bottom) Memory map after target changes (step $t_{i}$).
  • ...and 1 more figures