Table of Contents
Fetching ...

Speaking the Language of Teamwork: LLM-Guided Credit Assignment in Multi-Agent Reinforcement Learning

Muhan Lin, Shuyang Shi, Yue Guo, Vaishnav Tadiparthi, Behdad Chalaki, Ehsan Moradi Pari, Simon Stepputtis, Woojun Kim, Joseph Campbell, Katia Sycara

TL;DR

Credit attribution in MARL is challenging under sparse rewards. The paper proposes LLM-guided Credit Assignment (LCA), which uses an LLM to generate dense, agent-specific rewards by ranking states from each agent's perspective and learning per-agent potential functions to shape rewards. This potential-based decomposition mitigates ranking uncertainty and enables CTDE training to improve convergence and policy performance. Experiments across grid-world and pistonball demonstrate faster learning, higher returns, and robustness to ranking errors, including when using smaller LLMs, suggesting practical applicability in scalable MARL settings with sparse feedback.

Abstract

Credit assignment, the process of attributing credit or blame to individual agents for their contributions to a team's success or failure, remains a fundamental challenge in multi-agent reinforcement learning (MARL), particularly in environments with sparse rewards. Commonly-used approaches such as value decomposition often lead to suboptimal policies in these settings, and designing dense reward functions that align with human intuition can be complex and labor-intensive. In this work, we propose a novel framework where a large language model (LLM) generates dense, agent-specific rewards based on a natural language description of the task and the overall team goal. By learning a potential-based reward function over multiple queries, our method reduces the impact of ranking errors while allowing the LLM to evaluate each agent's contribution to the overall task. Through extensive experiments, we demonstrate that our approach achieves faster convergence and higher policy returns compared to state-of-the-art MARL baselines.

Speaking the Language of Teamwork: LLM-Guided Credit Assignment in Multi-Agent Reinforcement Learning

TL;DR

Credit attribution in MARL is challenging under sparse rewards. The paper proposes LLM-guided Credit Assignment (LCA), which uses an LLM to generate dense, agent-specific rewards by ranking states from each agent's perspective and learning per-agent potential functions to shape rewards. This potential-based decomposition mitigates ranking uncertainty and enables CTDE training to improve convergence and policy performance. Experiments across grid-world and pistonball demonstrate faster learning, higher returns, and robustness to ranking errors, including when using smaller LLMs, suggesting practical applicability in scalable MARL settings with sparse feedback.

Abstract

Credit assignment, the process of attributing credit or blame to individual agents for their contributions to a team's success or failure, remains a fundamental challenge in multi-agent reinforcement learning (MARL), particularly in environments with sparse rewards. Commonly-used approaches such as value decomposition often lead to suboptimal policies in these settings, and designing dense reward functions that align with human intuition can be complex and labor-intensive. In this work, we propose a novel framework where a large language model (LLM) generates dense, agent-specific rewards based on a natural language description of the task and the overall team goal. By learning a potential-based reward function over multiple queries, our method reduces the impact of ranking errors while allowing the LLM to evaluate each agent's contribution to the overall task. Through extensive experiments, we demonstrate that our approach achieves faster convergence and higher policy returns compared to state-of-the-art MARL baselines.

Paper Structure

This paper contains 16 sections, 8 equations, 6 figures.

Figures (6)

  • Figure 1: Overview of our method LCA: We first generate the agent-specific encodings of state observations, and then prompt an LLM to execute pairwise state ranking from each agent's perspective in the contexts of collaboration. Specifically, if ranking state pairs in Agent 1's perspective, Agent 1 will be encoded as the "ego" agent and other agents as "teammates" in the observation, allowing the LLM to differentiate them with the language-based observation description. The individual rewards trained with such agent-specific ranking results properly handle the credit assignment in MARL. We test our approach in the grid world and pistonball environments.
  • Figure 2: Grid world environments with Two-Switch (left), Victim-Rubble (middle) and Pistonball (right) variants from left to right.
  • Figure 3: The average learning curves with reward functions trained from single LLM ranking per state pair in the Two-Switch, Victim-Rubble and Pistonball environments over 3 random seeds, with the return variance visualized as shaded areas. The training returns shown as the y axis are measured with vanilla individual rewards plus team rewards.
  • Figure 4: The learning curves with reward functions trained from four-query synthetic experiments over 3 random seeds.
  • Figure 5: The learning curves with reward functions trained from two-query with Llama3.1-70B:q3 over 3 random seeds.
  • ...and 1 more figures