Table of Contents
Fetching ...

Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration

Yang Zhang, Shixin Yang, Chenjia Bai, Fei Wu, Xiu Li, Zhen Wang, Xuelong Li

TL;DR

This work tackles the inefficiency of grounding LLM-driven planning in embodied multi-agent tasks by introducing Reinforced Advantage Feedback (ReAd). ReAd uses a critic to learn joint and local advantage functions from LLM-planned data, then treats the LLM planner as an optimizer to maximize these advantages, yielding two refinement modes: ReAd-S (sequential) and ReAd-J (joint). The approach is theoretically grounded via advantage-weighted regression extended to multi-agent settings and empirically validated on DV-RoCoBench and Overcooked-AI, showing higher success rates and substantially fewer environment interactions and LLM queries. These results demonstrate that advantage-based feedback can effectively ground LLMs for coordinated embodied tasks at a higher efficiency than physical verification-based methods.

Abstract

Grounding the reasoning ability of large language models (LLMs) for embodied tasks is challenging due to the complexity of the physical world. Especially, LLM planning for multi-agent collaboration requires communication of agents or credit assignment as the feedback to re-adjust the proposed plans and achieve effective coordination. However, existing methods that overly rely on physical verification or self-reflection suffer from excessive and inefficient querying of LLMs. In this paper, we propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans. Specifically, we perform critic regression to learn a sequential advantage function from LLM-planned data, and then treat the LLM planner as an optimizer to generate actions that maximize the advantage function. It endows the LLM with the foresight to discern whether the action contributes to accomplishing the final task. We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems. Experiments on Overcooked-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents and query rounds of LLMs, demonstrating its high efficiency for grounding LLMs. More results are given at https://embodied-read.github.io

Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration

TL;DR

This work tackles the inefficiency of grounding LLM-driven planning in embodied multi-agent tasks by introducing Reinforced Advantage Feedback (ReAd). ReAd uses a critic to learn joint and local advantage functions from LLM-planned data, then treats the LLM planner as an optimizer to maximize these advantages, yielding two refinement modes: ReAd-S (sequential) and ReAd-J (joint). The approach is theoretically grounded via advantage-weighted regression extended to multi-agent settings and empirically validated on DV-RoCoBench and Overcooked-AI, showing higher success rates and substantially fewer environment interactions and LLM queries. These results demonstrate that advantage-based feedback can effectively ground LLMs for coordinated embodied tasks at a higher efficiency than physical verification-based methods.

Abstract

Grounding the reasoning ability of large language models (LLMs) for embodied tasks is challenging due to the complexity of the physical world. Especially, LLM planning for multi-agent collaboration requires communication of agents or credit assignment as the feedback to re-adjust the proposed plans and achieve effective coordination. However, existing methods that overly rely on physical verification or self-reflection suffer from excessive and inefficient querying of LLMs. In this paper, we propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans. Specifically, we perform critic regression to learn a sequential advantage function from LLM-planned data, and then treat the LLM planner as an optimizer to generate actions that maximize the advantage function. It endows the LLM with the foresight to discern whether the action contributes to accomplishing the final task. We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems. Experiments on Overcooked-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents and query rounds of LLMs, demonstrating its high efficiency for grounding LLMs. More results are given at https://embodied-read.github.io
Paper Structure (51 sections, 3 theorems, 40 equations, 19 figures, 10 tables, 3 algorithms)

This paper contains 51 sections, 3 theorems, 40 equations, 19 figures, 10 tables, 3 algorithms.

Key Result

Lemma 1

(Multi-Agent Advantage Decomposition). In any cooperative Markov games, given a joint policy $\boldsymbol{\pi}$ and the whole set of agents $\mathcal{N}=\{1, .., n\}$, for any state $s$, and any ordered set $i_{1:n}$ of all agents, we have where $\boldsymbol{a} = (a^1, a^2, ..., a^n)$.

Figures (19)

  • Figure 1: An illustration of the negotiation process of RoCo and our method. RoCo interacts with the environment for each plan and takes the environment's feedback as prompts. In contrast, our method takes the advantage function (Adv.) evaluated by a critic as feedback, and revises the plan if the advantage value is lower than the threshold, which significantly reduces the interaction rounds to the environment.
  • Figure 2: An overview of prompting and refinement. For each timestep $t$, the LLM planner is given the history, which contains states, actions, and advantages, and is prompted to generate a plan with the highest advantage. The pre-trained critic is used to evaluate the score of the generated action $\mathbb{S}_{\rm ReAd}(a_t^i)$. If $\mathbb{S}_{\rm ReAd}(a_t^i)<\epsilon$, the failed plan is used as a prompt, and the LLM planer is asked to refine the policy until the $\mathbb{S}_{\rm ReAd}(a_t^i) > \epsilon$. The (refined) action is used to interact with the environment, and the LLM planner is processed in the next step.
  • Figure 3: We report mean SR ($\boldsymbol{\uparrow}$), ES ($\boldsymbol{\downarrow}$), and NQ ($\boldsymbol{\downarrow}$) in 3 tasks with various difficulty levels averaged over 10 random seeds. The detailed score is given in Table \ref{['main-result-table-1']} of §\ref{['app:main-result']}.
  • Figure 4: The initial states of the 5 difficulty levels in modified Sweep Floor. The yellow and green squares are the ones to be swept in this task. The first three tasks have a total of 7 squares, while the last two have 9. We assess task difficulty based on the number of cubes to be swept and the total cube number. For example, the Y1_G1 in the figure represents 1 yellow cube and 1 green cube needs to be swept.
  • Figure 5: The initial states of the 4 difficulty levels in modified Make Sandwich. The initial three tasks shared the same food and layout, differing only in the length of the recipe. Conversely, the final task presented distinct food and layout, accompanied by a lengthier recipe. The recipe lengths for four tasks are set to 3, 5, 7, and 9, respectively.
  • ...and 14 more figures

Theorems & Definitions (7)

  • Definition 1
  • Lemma 1
  • proof
  • Proposition 1
  • proof
  • Proposition 2
  • proof