Leveraging Large Language Models for Effective and Explainable Multi-Agent Credit Assignment
Kartik Nagpal, Dayi Dong, Jean-Baptiste Bouvier, Negar Mehr
TL;DR
This work tackles the credit assignment problem in centralized-training decentralized-execution (CTDE) multi-agent reinforcement learning by reframing it as a pattern-recognition task. It introduces two Large Language Model (LLM)–based centralized critics, LLM-MCA and its extension LLM-TACA, which provide per-agent credit signals (and explanations) and, in the case of TACA, explicit task assignments to agents during training. The methods demonstrably outperform state-of-the-art baselines on diverse benchmarks, including Level-Based Foraging, Robotic Warehouse, and the new Spaceworld environment that emphasizes safety and collision avoidance, while also producing a rich offline dataset of trajectories with per-agent annotations. The findings highlight the potential of LLM-driven critics for explainable, dense credit feedback and suggest future directions toward real-time, low-cost evaluations and broader non-cooperative settings.
Abstract
Recent work, spanning from autonomous vehicle coordination to in-space assembly, has shown the importance of learning collaborative behavior for enabling robots to achieve shared goals. A common approach for learning this cooperative behavior is to utilize the centralized-training decentralized-execution paradigm. However, this approach also introduces a new challenge: how do we evaluate the contributions of each agent's actions to the overall success or failure of the team. This credit assignment problem has remained open, and has been extensively studied in the Multi-Agent Reinforcement Learning literature. In fact, humans manually inspecting agent behavior often generate better credit evaluations than existing methods. We combine this observation with recent works which show Large Language Models demonstrate human-level performance at many pattern recognition tasks. Our key idea is to reformulate credit assignment to the two pattern recognition problems of sequence improvement and attribution, which motivates our novel LLM-MCA method. Our approach utilizes a centralized LLM reward-critic which numerically decomposes the environment reward based on the individualized contribution of each agent in the scenario. We then update the agents' policy networks based on this feedback. We also propose an extension LLM-TACA where our LLM critic performs explicit task assignment by passing an intermediary goal directly to each agent policy in the scenario. Both our methods far outperform the state-of-the-art on a variety of benchmarks, including Level-Based Foraging, Robotic Warehouse, and our new Spaceworld benchmark which incorporates collision-related safety constraints. As an artifact of our methods, we generate large trajectory datasets with each timestep annotated with per-agent reward information, as sampled from our LLM critics.
