Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models

Gabriel Sarch; Yue Wu; Michael J. Tarr; Katerina Fragkiadaki

Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models

Gabriel Sarch, Yue Wu, Michael J. Tarr, Katerina Fragkiadaki

TL;DR

HELPER addresses open-domain, natural-language instructions for embodied agents by augmenting frozen LLMs with a retrieval-based external memory of language–program pairs to generate executable visuomotor plans. The planner retrieves context-relevant examples ($K=3$) to produce Python programs over a fixed primitive API, while the executor builds semantic and occupancy maps, checks preconditions, and uses a Vision-Language Model for failure correction. Empirical results on the TEACh benchmark set a new state-of-the-art in both TfD and EDH, with substantial gains from memory-augmented prompting, failure diagnosis, and user feedback-driven personalization. The work advances practical, personalized human–robot collaboration in household settings by enabling robust open-ended instruction parsing and adaptive planning without task-specific finetuning, while outlining clear avenues for reducing perception and compute bottlenecks and extending multimodal memory.

Abstract

Pre-trained and frozen large language models (LLMs) can effectively map simple scene rearrangement instructions to programs over a robot's visuomotor functions through appropriate few-shot example prompting. To parse open-domain natural language and adapt to a user's idiosyncratic procedures, not known during prompt engineering time, fixed prompts fall short. In this paper, we introduce HELPER, an embodied agent equipped with an external memory of language-program pairs that parses free-form human-robot dialogue into action programs through retrieval-augmented LLM prompting: relevant memories are retrieved based on the current dialogue, instruction, correction, or VLM description, and used as in-context prompt examples for LLM querying. The memory is expanded during deployment to include pairs of user's language and action plans, to assist future inferences and personalize them to the user's language and routines. HELPER sets a new state-of-the-art in the TEACh benchmark in both Execution from Dialog History (EDH) and Trajectory from Dialogue (TfD), with a 1.7x improvement over the previous state-of-the-art for TfD. Our models, code, and video results can be found in our project's website: https://helper-agent-llm.github.io.

Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models

TL;DR

) to produce Python programs over a fixed primitive API, while the executor builds semantic and occupancy maps, checks preconditions, and uses a Vision-Language Model for failure correction. Empirical results on the TEACh benchmark set a new state-of-the-art in both TfD and EDH, with substantial gains from memory-augmented prompting, failure diagnosis, and user feedback-driven personalization. The work advances practical, personalized human–robot collaboration in household settings by enabling robust open-ended instruction parsing and adaptive planning without task-specific finetuning, while outlining clear avenues for reducing perception and compute bottlenecks and extending multimodal memory.

Abstract

Paper Structure (45 sections, 1 equation, 4 figures, 4 tables)

This paper contains 45 sections, 1 equation, 4 figures, 4 tables.

Introduction
Related Work
Instructable Embodied Agents
Prompting LLMs for action prediction and visual reasoning
Method
Planner: Retrieval-Augmented LLM Planning
Memory Expansion
Incorporating user feedback
Visually-Grounded Plan Correction using Vision-Language Models
Executor: Scene Perception, Pre-Condition Checks, Object Search and Action Execution
Scene and object state perception
Manipulation and navigation pre-condition checks
Locator: LLM-based common sense object search
Experiments
Evaluation on the TEACh dataset
...and 30 more sections

Figures (4)

Figure 1: Open-ended instructable agents with retrieval-augmented LLMs. We equip LLMs with an external memory of language and program pairs to retrieve in-context examples for prompts during LLM querying for task plans. Our model takes as input instructions, dialogue segments, corrections and VLM environment descriptions, retrieves relevant memories to use as in-context examples, and prompts LLMs to predict task plans and plan adjustments. Our agent executes the predicted plans from visual input using occupancy and semantic map building, 3D object detection and state tracking, and active exploration using guidance from LLMs' common sense to locate objects not present in the maps. Successful programs are added to the memory paired with their language context, allowing for personalized subsequent interactions.
Figure 2: HELPER's architecture. The model uses memory-augmented LLM prompting for task planning from instructions, corrections and human-robot dialogue and for re-planning during failures given feedback from a VLM model. The generated program is executed the Executor module. The Executor builds semantic, occupancy and 3D object maps, tracks object states, verifies action preconditions, and queries LLMs for search locations for objects missing from the maps, using the Locator module.
Figure 3: HELPER parses dialogue segments, instructions, and corrections into visuomotor programs using retrieval-augmented LLM prompting. A. Illustration of the encoding and memory retrieval process. B. Prompt format and output of the ${ \textsc{Planner}}$.
Figure 4: Inference of a failure feedback description by matching potential failure language descriptions with the current image using a vision-language model (VLM).

Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models

TL;DR

Abstract

Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)