HELPER-X: A Unified Instructable Embodied Agent to Tackle Four Interactive Vision-Language Domains with Memory-Augmented Language Models

Gabriel Sarch; Sahil Somani; Raghav Kapoor; Michael J. Tarr; Katerina Fragkiadaki

HELPER-X: A Unified Instructable Embodied Agent to Tackle Four Interactive Vision-Language Domains with Memory-Augmented Language Models

Gabriel Sarch, Sahil Somani, Raghav Kapoor, Michael J. Tarr, Katerina Fragkiadaki

TL;DR

HELPER-X expands HELPER with memory-augmented prompting to tackle four interactive vision-language domains (TEACh, ALFRED, DialFRED, Tidy Task) using two memory strategies: domain-specific prompt templates (HELPER-X_P) and a shared cross-domain memory (HELPER-X_S). It also introduces a question-asking API to enhance information gathering during planning. Across four benchmarks, HELPER-X achieves state-of-the-art few-shot performance without in-domain training, maintaining or improving results relative to domain-specialized baselines. The approach demonstrates that memory-augmented LLMs can serve as versatile, instruction-tolerant planners for embodied agents, with practical implications for scalable, multi-domain mediators in vision-language tasks. Limitations include reliance on costly models (GPT-4) and the need for future automation to extend to new domains without manual template curation.

Abstract

Recent research on instructable agents has used memory-augmented Large Language Models (LLMs) as task planners, a technique that retrieves language-program examples relevant to the input instruction and uses them as in-context examples in the LLM prompt to improve the performance of the LLM in inferring the correct action and task plans. In this technical report, we extend the capabilities of HELPER, by expanding its memory with a wider array of examples and prompts, and by integrating additional APIs for asking questions. This simple expansion of HELPER into a shared memory enables the agent to work across the domains of executing plans from dialogue, natural language instruction following, active question asking, and commonsense room reorganization. We evaluate the agent on four diverse interactive visual-language embodied agent benchmarks: ALFRED, TEACh, DialFRED, and the Tidy Task. HELPER-X achieves few-shot, state-of-the-art performance across these benchmarks using a single agent, without requiring in-domain training, and remains competitive with agents that have undergone in-domain training.

HELPER-X: A Unified Instructable Embodied Agent to Tackle Four Interactive Vision-Language Domains with Memory-Augmented Language Models

TL;DR

Abstract

Paper Structure (39 sections, 1 equation, 2 figures, 5 tables)

This paper contains 39 sections, 1 equation, 2 figures, 5 tables.

Introduction
HELPER-X
Background
Unified Memory-Augmented Prompting
Prompt Retrieval
Shared Example Memory
Question Asking API
Experiments
Inferring and Executing Action Plans from Dialogue
Dataset
Following Natural Language Instructions
Dataset
Instruction Following with Asking Questions
Dataset
Tidying Up using Spatial Commonsense Reasoning
...and 24 more sections

Figures (2)

Figure 1: TEACh-tailored HELPER sarch2023helper demonstrates a 6.9% drop in success when applied to ALFRED, despite sharing the same action space and environments, due to variations in language inputs and tasks. HELPER-X consistently performs well in both domains with one model.
Figure 2: Illustration of the shared example memory (HELPER-X$_{S}$; top) and the prompt retrieval (HELPER-X$_{P}$; bottom). The memory is shared across domains in both versions, allowing language and task inputs from any of the domains.

HELPER-X: A Unified Instructable Embodied Agent to Tackle Four Interactive Vision-Language Domains with Memory-Augmented Language Models

TL;DR

Abstract

HELPER-X: A Unified Instructable Embodied Agent to Tackle Four Interactive Vision-Language Domains with Memory-Augmented Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)