Table of Contents
Fetching ...

Leveraging Large Language Models for Effective and Explainable Multi-Agent Credit Assignment

Kartik Nagpal, Dayi Dong, Jean-Baptiste Bouvier, Negar Mehr

TL;DR

This work tackles the credit assignment problem in centralized-training decentralized-execution (CTDE) multi-agent reinforcement learning by reframing it as a pattern-recognition task. It introduces two Large Language Model (LLM)–based centralized critics, LLM-MCA and its extension LLM-TACA, which provide per-agent credit signals (and explanations) and, in the case of TACA, explicit task assignments to agents during training. The methods demonstrably outperform state-of-the-art baselines on diverse benchmarks, including Level-Based Foraging, Robotic Warehouse, and the new Spaceworld environment that emphasizes safety and collision avoidance, while also producing a rich offline dataset of trajectories with per-agent annotations. The findings highlight the potential of LLM-driven critics for explainable, dense credit feedback and suggest future directions toward real-time, low-cost evaluations and broader non-cooperative settings.

Abstract

Recent work, spanning from autonomous vehicle coordination to in-space assembly, has shown the importance of learning collaborative behavior for enabling robots to achieve shared goals. A common approach for learning this cooperative behavior is to utilize the centralized-training decentralized-execution paradigm. However, this approach also introduces a new challenge: how do we evaluate the contributions of each agent's actions to the overall success or failure of the team. This credit assignment problem has remained open, and has been extensively studied in the Multi-Agent Reinforcement Learning literature. In fact, humans manually inspecting agent behavior often generate better credit evaluations than existing methods. We combine this observation with recent works which show Large Language Models demonstrate human-level performance at many pattern recognition tasks. Our key idea is to reformulate credit assignment to the two pattern recognition problems of sequence improvement and attribution, which motivates our novel LLM-MCA method. Our approach utilizes a centralized LLM reward-critic which numerically decomposes the environment reward based on the individualized contribution of each agent in the scenario. We then update the agents' policy networks based on this feedback. We also propose an extension LLM-TACA where our LLM critic performs explicit task assignment by passing an intermediary goal directly to each agent policy in the scenario. Both our methods far outperform the state-of-the-art on a variety of benchmarks, including Level-Based Foraging, Robotic Warehouse, and our new Spaceworld benchmark which incorporates collision-related safety constraints. As an artifact of our methods, we generate large trajectory datasets with each timestep annotated with per-agent reward information, as sampled from our LLM critics.

Leveraging Large Language Models for Effective and Explainable Multi-Agent Credit Assignment

TL;DR

This work tackles the credit assignment problem in centralized-training decentralized-execution (CTDE) multi-agent reinforcement learning by reframing it as a pattern-recognition task. It introduces two Large Language Model (LLM)–based centralized critics, LLM-MCA and its extension LLM-TACA, which provide per-agent credit signals (and explanations) and, in the case of TACA, explicit task assignments to agents during training. The methods demonstrably outperform state-of-the-art baselines on diverse benchmarks, including Level-Based Foraging, Robotic Warehouse, and the new Spaceworld environment that emphasizes safety and collision avoidance, while also producing a rich offline dataset of trajectories with per-agent annotations. The findings highlight the potential of LLM-driven critics for explainable, dense credit feedback and suggest future directions toward real-time, low-cost evaluations and broader non-cooperative settings.

Abstract

Recent work, spanning from autonomous vehicle coordination to in-space assembly, has shown the importance of learning collaborative behavior for enabling robots to achieve shared goals. A common approach for learning this cooperative behavior is to utilize the centralized-training decentralized-execution paradigm. However, this approach also introduces a new challenge: how do we evaluate the contributions of each agent's actions to the overall success or failure of the team. This credit assignment problem has remained open, and has been extensively studied in the Multi-Agent Reinforcement Learning literature. In fact, humans manually inspecting agent behavior often generate better credit evaluations than existing methods. We combine this observation with recent works which show Large Language Models demonstrate human-level performance at many pattern recognition tasks. Our key idea is to reformulate credit assignment to the two pattern recognition problems of sequence improvement and attribution, which motivates our novel LLM-MCA method. Our approach utilizes a centralized LLM reward-critic which numerically decomposes the environment reward based on the individualized contribution of each agent in the scenario. We then update the agents' policy networks based on this feedback. We also propose an extension LLM-TACA where our LLM critic performs explicit task assignment by passing an intermediary goal directly to each agent policy in the scenario. Both our methods far outperform the state-of-the-art on a variety of benchmarks, including Level-Based Foraging, Robotic Warehouse, and our new Spaceworld benchmark which incorporates collision-related safety constraints. As an artifact of our methods, we generate large trajectory datasets with each timestep annotated with per-agent reward information, as sampled from our LLM critics.

Paper Structure

This paper contains 15 sections, 3 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overall architecture diagram for our LLM-MCA and LLM-TACA methods. Our centralized training architecture utilizes a centralized LLM-critic instantiated with our base prompt (environment description, our definitions, and task query). At each timestep, we update our LLM-critic with the global reward and latest observations from the environment. We then update our agents' policies with the individualized feedback from our critic.
  • Figure 2: Diagram for our batch-training process with our LLM-MCA method. Our centralized training process allows us to provide entire batches of trajectories at once to our centralized LLM-critic. Our LLM-MCA critic then generates individualized feedback for each agent, which we use to update their policies. After training, we no longer need our LLM-critic, and directly deploy our trained, decentralized agent policies.
  • Figure 3: Example prompt for LLM-MCA in the "Spaceworld" benchmark. Our base prompt $p_\text{base} := (p_\text{env}, p_\text{desc}, p_\text{defn}, p_\text{task})$ is divided into (1) a description of the scenario's rules and objectives, (2) a description of the kinds of inputs it will now receive from that environment, (3) our agreement problem definitions with examples, and (4) a description of its role as a credit assignment agent along with the formatting requirements of its output.
  • Figure 4: Example output from our LLM-TACA method in the level-based foraging benchmark. Our LLM-critic provides individualized credit assignments for the previous timesteps, task assignments, and explanations for its decisions.
  • Figure 5: Comparison of our methods with baselines on benchmarks. All values reported are averaged across five separate trainings and the 95% confidence interval is illustrated by the error bars. Spaceworld is our custom benchmark where agents must avoid collision while transporting respective parts to their destinations.".
  • ...and 1 more figures