Table of Contents
Fetching ...

P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task

Weiye Xu, Min Wang, Wengang Zhou, Houqiang Li

TL;DR

This paper tackles planning for embodied everyday tasks by introducing Progressive Retrieval Augmented Generation (P-RAG), which progressively enriches a dynamic database with task-specific experiences without ground-truth data. By retrieving both task-name and scene-graph-based experiences across iterative interactions, P-RAG guides a large language model to generate more informed action sequences. The approach combines LLM planning with regex-based error checking, action filtering, and decomposition, while leveraging a MiniLM-embedded, cosine-similarity-based retrieval of top-K historical trajectories. Experiments on MINI-BEHAVIOR and ALFRED demonstrate that P-RAG achieves competitive, ground-truth-free performance, with self-iteration further boosting results and showing rapid saturation, indicating effective knowledge accumulation. The work suggests broad applicability of ground-truth-free, progressive retrieval for planning in interactive environments.

Abstract

Embodied Everyday Task is a popular task in the embodied AI community, requiring agents to make a sequence of actions based on natural language instructions and visual observations. Traditional learning-based approaches face two challenges. Firstly, natural language instructions often lack explicit task planning. Secondly, extensive training is required to equip models with knowledge of the task environment. Previous works based on Large Language Model (LLM) either suffer from poor performance due to the lack of task-specific knowledge or rely on ground truth as few-shot samples. To address the above limitations, we propose a novel approach called Progressive Retrieval Augmented Generation (P-RAG), which not only effectively leverages the powerful language processing capabilities of LLMs but also progressively accumulates task-specific knowledge without ground-truth. Compared to the conventional RAG methods, which retrieve relevant information from the database in a one-shot manner to assist generation, P-RAG introduces an iterative approach to progressively update the database. In each iteration, P-RAG retrieves the latest database and obtains historical information from the previous interaction as experiential references for the current interaction. Moreover, we also introduce a more granular retrieval scheme that not only retrieves similar tasks but also incorporates retrieval of similar situations to provide more valuable reference experiences. Extensive experiments reveal that P-RAG achieves competitive results without utilizing ground truth and can even further improve performance through self-iterations.

P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task

TL;DR

This paper tackles planning for embodied everyday tasks by introducing Progressive Retrieval Augmented Generation (P-RAG), which progressively enriches a dynamic database with task-specific experiences without ground-truth data. By retrieving both task-name and scene-graph-based experiences across iterative interactions, P-RAG guides a large language model to generate more informed action sequences. The approach combines LLM planning with regex-based error checking, action filtering, and decomposition, while leveraging a MiniLM-embedded, cosine-similarity-based retrieval of top-K historical trajectories. Experiments on MINI-BEHAVIOR and ALFRED demonstrate that P-RAG achieves competitive, ground-truth-free performance, with self-iteration further boosting results and showing rapid saturation, indicating effective knowledge accumulation. The work suggests broad applicability of ground-truth-free, progressive retrieval for planning in interactive environments.

Abstract

Embodied Everyday Task is a popular task in the embodied AI community, requiring agents to make a sequence of actions based on natural language instructions and visual observations. Traditional learning-based approaches face two challenges. Firstly, natural language instructions often lack explicit task planning. Secondly, extensive training is required to equip models with knowledge of the task environment. Previous works based on Large Language Model (LLM) either suffer from poor performance due to the lack of task-specific knowledge or rely on ground truth as few-shot samples. To address the above limitations, we propose a novel approach called Progressive Retrieval Augmented Generation (P-RAG), which not only effectively leverages the powerful language processing capabilities of LLMs but also progressively accumulates task-specific knowledge without ground-truth. Compared to the conventional RAG methods, which retrieve relevant information from the database in a one-shot manner to assist generation, P-RAG introduces an iterative approach to progressively update the database. In each iteration, P-RAG retrieves the latest database and obtains historical information from the previous interaction as experiential references for the current interaction. Moreover, we also introduce a more granular retrieval scheme that not only retrieves similar tasks but also incorporates retrieval of similar situations to provide more valuable reference experiences. Extensive experiments reveal that P-RAG achieves competitive results without utilizing ground truth and can even further improve performance through self-iterations.
Paper Structure (16 sections, 4 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 4 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: The framework of our Progressive Retrieval Augmented Generation method. a) The database consists of a list of tuples, each including goal instruction, scene graph, trajectory history, and whether the task is completed. b) The agent with LLM. c) The interactive environment (MINI-BEHAVIOR or ALFRED). The database will update after each complete interaction between the agent and the environment, equipping the agent with increasingly high-quality experiences.
  • Figure 2: The pipeline of P-RAG in each iteration. "#" stands for the form of text. a) The information transmits to the agent consists of the following four parts: natural language goal instruction, observations obtained from the environment, the action space of the agent, and the retrieval results from the database. b) The agent adopts an LLM to plan a series of actions according to the information in (a). If the LLM produces unsatisfactory content, the agent will initiate a reattempt; otherwise, it will utilize a filtering mechanism to extract the requisite actions from the fields. c) The environment receives actions from the agent and returns observations, along with a "done" state denoting whether the task is completed. d) Following the completion of each iteration comprising multiple tasks, the database undergoes an update procedure. During each update, it stores the embedding vector of the goal instruction and the scene graph obtained through observation. e) The database contains the trajectories of previous iterations. f) The interface between the database and the agent's information involves two main components. Firstly, the current goal instruction and observation of agent are embedded into vectors, which are further used as query in retrieval augmented process. Secondly, the similarity between query and each database item is computed, and the top K relevant database items are returned to agent.
  • Figure 3: Database Construction and Retrieval. In P-RAG, both the construction of database and retrieval utilize encoding. 1) During the insertion process, four components are inputted: goal instruction, scene graph, history, and done. Among these, goal instruction and scene graph need to undergo sentence embedding to be stored as vectors in the database. 2) When retrieval is required, the current task's goal instruction and scene graph are used as queries. They are also encoded to sentence embedding, while simultaneously computing the similarity score between them and the corresponding vectors in the database. The top K historical trajectory information is then returned based on the aggregated similarity scores.
  • Figure 4: Comparison on planning trajectories between GPT-4 baseline and P-RAG. The baseline method follows a decision process of sequentially picking up three pot plants and placing them in the sink, considering the task complete. However, it fails to achieve the task successfully. In contrast, P-RAG utilizes comprehensive historical trajectory information to make decisions, leading to the judgment to toggle the sink and ultimately accomplishing the task.
  • Figure 5: Performance with Iteration Number on ALFRED.