Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models
Gabriel Sarch, Yue Wu, Michael J. Tarr, Katerina Fragkiadaki
TL;DR
HELPER addresses open-domain, natural-language instructions for embodied agents by augmenting frozen LLMs with a retrieval-based external memory of language–program pairs to generate executable visuomotor plans. The planner retrieves context-relevant examples ($K=3$) to produce Python programs over a fixed primitive API, while the executor builds semantic and occupancy maps, checks preconditions, and uses a Vision-Language Model for failure correction. Empirical results on the TEACh benchmark set a new state-of-the-art in both TfD and EDH, with substantial gains from memory-augmented prompting, failure diagnosis, and user feedback-driven personalization. The work advances practical, personalized human–robot collaboration in household settings by enabling robust open-ended instruction parsing and adaptive planning without task-specific finetuning, while outlining clear avenues for reducing perception and compute bottlenecks and extending multimodal memory.
Abstract
Pre-trained and frozen large language models (LLMs) can effectively map simple scene rearrangement instructions to programs over a robot's visuomotor functions through appropriate few-shot example prompting. To parse open-domain natural language and adapt to a user's idiosyncratic procedures, not known during prompt engineering time, fixed prompts fall short. In this paper, we introduce HELPER, an embodied agent equipped with an external memory of language-program pairs that parses free-form human-robot dialogue into action programs through retrieval-augmented LLM prompting: relevant memories are retrieved based on the current dialogue, instruction, correction, or VLM description, and used as in-context prompt examples for LLM querying. The memory is expanded during deployment to include pairs of user's language and action plans, to assist future inferences and personalize them to the user's language and routines. HELPER sets a new state-of-the-art in the TEACh benchmark in both Execution from Dialog History (EDH) and Trajectory from Dialogue (TfD), with a 1.7x improvement over the previous state-of-the-art for TfD. Our models, code, and video results can be found in our project's website: https://helper-agent-llm.github.io.
