Table of Contents
Fetching ...

RAG-Modulo: Solving Sequential Tasks using Experience, Critics, and Language Models

Abhinav Jain, Chris Jermaine, Vaibhav Unhelkar

TL;DR

The paper addresses the challenge of learning from past interactions in goal-driven robotic tasks with partial observability by augmenting LLM-based agents with an interaction memory and a bank of critics. RAG-Modulo retrieves relevant past interactions as in-context examples and incorporates feedback from syntax, semantics, and low-level executability critics to guide decision-making without gradient updates. The authors introduce a memory-based retrieval mechanism using cosine similarity to populate prompts with informative exemplars and demonstrate superior performance on AlfWorld and BabyAI benchmarks, achieving higher success rates and more efficient planning than strong baselines. This work highlights data-efficient learning for long-horizon robotic tasks and suggests paths toward real-world deployment and integration with continual learning frameworks.

Abstract

Large language models (LLMs) have recently emerged as promising tools for solving challenging robotic tasks, even in the presence of action and observation uncertainties. Recent LLM-based decision-making methods (also referred to as LLM-based agents), when paired with appropriate critics, have demonstrated potential in solving complex, long-horizon tasks with relatively few interactions. However, most existing LLM-based agents lack the ability to retain and learn from past interactions - an essential trait of learning-based robotic systems. We propose RAG-Modulo, a framework that enhances LLM-based agents with a memory of past interactions and incorporates critics to evaluate the agents' decisions. The memory component allows the agent to automatically retrieve and incorporate relevant past experiences as in-context examples, providing context-aware feedback for more informed decision-making. Further by updating its memory, the agent improves its performance over time, thereby exhibiting learning. Through experiments in the challenging BabyAI and AlfWorld domains, we demonstrate significant improvements in task success rates and efficiency, showing that the proposed RAG-Modulo framework outperforms state-of-the-art baselines.

RAG-Modulo: Solving Sequential Tasks using Experience, Critics, and Language Models

TL;DR

The paper addresses the challenge of learning from past interactions in goal-driven robotic tasks with partial observability by augmenting LLM-based agents with an interaction memory and a bank of critics. RAG-Modulo retrieves relevant past interactions as in-context examples and incorporates feedback from syntax, semantics, and low-level executability critics to guide decision-making without gradient updates. The authors introduce a memory-based retrieval mechanism using cosine similarity to populate prompts with informative exemplars and demonstrate superior performance on AlfWorld and BabyAI benchmarks, achieving higher success rates and more efficient planning than strong baselines. This work highlights data-efficient learning for long-horizon robotic tasks and suggests paths toward real-world deployment and integration with continual learning frameworks.

Abstract

Large language models (LLMs) have recently emerged as promising tools for solving challenging robotic tasks, even in the presence of action and observation uncertainties. Recent LLM-based decision-making methods (also referred to as LLM-based agents), when paired with appropriate critics, have demonstrated potential in solving complex, long-horizon tasks with relatively few interactions. However, most existing LLM-based agents lack the ability to retain and learn from past interactions - an essential trait of learning-based robotic systems. We propose RAG-Modulo, a framework that enhances LLM-based agents with a memory of past interactions and incorporates critics to evaluate the agents' decisions. The memory component allows the agent to automatically retrieve and incorporate relevant past experiences as in-context examples, providing context-aware feedback for more informed decision-making. Further by updating its memory, the agent improves its performance over time, thereby exhibiting learning. Through experiments in the challenging BabyAI and AlfWorld domains, we demonstrate significant improvements in task success rates and efficiency, showing that the proposed RAG-Modulo framework outperforms state-of-the-art baselines.
Paper Structure (12 sections, 4 equations, 5 figures, 2 tables, 2 algorithms)

This paper contains 12 sections, 4 equations, 5 figures, 2 tables, 2 algorithms.

Figures (5)

  • Figure 1: The RAG-Modulo framework incorporates a language model to generate candidate actions and a set of critics to evaluate them. Importantly, it features mechanisms for storing and retrieving past interactions, which enable learning from experience and improve decision-making over time.
  • Figure 2: (Left) The prompt in RAG-Modulo consists of an environment descriptor, a history of past interactions, and in-context examples to guide the LLM in selecting a feasible action. Here, the agent can be carrying a blue key, which it needs to drop before picking up the green key. The retrieved in-context example shows a similar scenario where the agent is unable to drop an object in an occupied cell. Based on this, the agent generates an action to move to an empty cell before completing the task. (Right) Illustration of how each critic provides feedback for the infeasible action shown on top.
  • Figure 3: (Left) AlfWorld Domain where the agent is shown in a household environment. (Right) Execution trace while solving a task from BabyAI. Ticks and Crosses show feasible and infeasible actions respectively.
  • Figure 4: Success Rate as a function of $K$
  • Figure 5: In-Executability and Episode Length as a function of $K$