Table of Contents
Fetching ...

Assigning Credit with Partial Reward Decoupling in Multi-Agent Proximal Policy Optimization

Aditya Kapoor, Benjamin Freed, Howie Choset, Jeff Schneider

TL;DR

MAPPO's credit assignment challenge grows with team size, hindering data efficiency. The authors introduce PRD-MAPPO, merging Partial Reward Decoupling with MAPPO via an attention-based critic to identify agent-relevant sets, enabling linear-time advantage estimation and a soft, scalable update scheme, including a shared-reward variant. Across MARL benchmarks including StarCraft II, PRD-MAPPO variants outperform MAPPO and several baselines, with PRD-MAPPO-soft often delivering the strongest results; gradient-variance analyses and attention visualizations support improved credit assignment and learning stability. This work broadens MARL scalability to larger teams and to shared-reward settings, offering a practical approach to cooperative multi-agent learning with reduced data requirements.

Abstract

Multi-agent proximal policy optimization (MAPPO) has recently demonstrated state-of-the-art performance on challenging multi-agent reinforcement learning tasks. However, MAPPO still struggles with the credit assignment problem, wherein the sheer difficulty in ascribing credit to individual agents' actions scales poorly with team size. In this paper, we propose a multi-agent reinforcement learning algorithm that adapts recent developments in credit assignment to improve upon MAPPO. Our approach leverages partial reward decoupling (PRD), which uses a learned attention mechanism to estimate which of a particular agent's teammates are relevant to its learning updates. We use this estimate to dynamically decompose large groups of agents into smaller, more manageable subgroups. We empirically demonstrate that our approach, PRD-MAPPO, decouples agents from teammates that do not influence their expected future reward, thereby streamlining credit assignment. We additionally show that PRD-MAPPO yields significantly higher data efficiency and asymptotic performance compared to both MAPPO and other state-of-the-art methods across several multi-agent tasks, including StarCraft II. Finally, we propose a version of PRD-MAPPO that is applicable to \textit{shared} reward settings, where PRD was previously not applicable, and empirically show that this also leads to performance improvements over MAPPO.

Assigning Credit with Partial Reward Decoupling in Multi-Agent Proximal Policy Optimization

TL;DR

MAPPO's credit assignment challenge grows with team size, hindering data efficiency. The authors introduce PRD-MAPPO, merging Partial Reward Decoupling with MAPPO via an attention-based critic to identify agent-relevant sets, enabling linear-time advantage estimation and a soft, scalable update scheme, including a shared-reward variant. Across MARL benchmarks including StarCraft II, PRD-MAPPO variants outperform MAPPO and several baselines, with PRD-MAPPO-soft often delivering the strongest results; gradient-variance analyses and attention visualizations support improved credit assignment and learning stability. This work broadens MARL scalability to larger teams and to shared-reward settings, offering a practical approach to cooperative multi-agent learning with reduced data requirements.

Abstract

Multi-agent proximal policy optimization (MAPPO) has recently demonstrated state-of-the-art performance on challenging multi-agent reinforcement learning tasks. However, MAPPO still struggles with the credit assignment problem, wherein the sheer difficulty in ascribing credit to individual agents' actions scales poorly with team size. In this paper, we propose a multi-agent reinforcement learning algorithm that adapts recent developments in credit assignment to improve upon MAPPO. Our approach leverages partial reward decoupling (PRD), which uses a learned attention mechanism to estimate which of a particular agent's teammates are relevant to its learning updates. We use this estimate to dynamically decompose large groups of agents into smaller, more manageable subgroups. We empirically demonstrate that our approach, PRD-MAPPO, decouples agents from teammates that do not influence their expected future reward, thereby streamlining credit assignment. We additionally show that PRD-MAPPO yields significantly higher data efficiency and asymptotic performance compared to both MAPPO and other state-of-the-art methods across several multi-agent tasks, including StarCraft II. Finally, we propose a version of PRD-MAPPO that is applicable to \textit{shared} reward settings, where PRD was previously not applicable, and empirically show that this also leads to performance improvements over MAPPO.
Paper Structure (22 sections, 11 equations, 5 figures, 12 tables, 1 algorithm)

This paper contains 22 sections, 11 equations, 5 figures, 12 tables, 1 algorithm.

Figures (5)

  • Figure 1: Q and Value Function Network Architecture. Each agent uses states from all agents to compute attention weights for every agent other than itself. These attention weights are then used to aggregate attention values from all agents other than itself. Finally, aggregated attention values for agent $i$ are concatenated either with the embedded state-action vector for agent $i$ (if the network is functioning as a Q function) or the embedded state vector for agent $i$, (if the network is functioning as a value function). Finally, this is passed through the output network to generate either $Q^{\phi}_i(s,a)$ or $V^{\psi}_i(s,a^{\neq i})$.
  • Figure 2: Average reward vs. episode for PRD-MAPPO-soft, PRD-MAPPO, PRD-V-MAPPO, COMA, LICA, QMix, MAPPO, MAPPO-G2ANet on A) team collision avoidance, B) pursuit, C) pressure plate, D) Level-Based Foraging, E) StarCraft 5m_vs_6m, F) StarCraft 10m_vs_11m tasks, and G) StarCraft 3s5v. Solid lines indicate the average over 5 random seeds, and shaded regions denote a 95% confidence interval. Approaches that incorporate PRD (PRD-MAPPO and PRD-MAPPO-soft) tend to outperform all other approaches, indicating that PRD can be leveraged to improve PPO by improving credit assignment.
  • Figure 3: Relevant set visualization in Collision Avoidance environment. We visualize the average attention weight that each agent assigns to every other agent, averaged across 5000 independent episodes. Because agents always assign an attention weight of 1 to themselves, we remove those elements from the plot as they are uninformative. We notice that generally agents assign a far higher attention weight to agents in their team, compared to agents on other teams, which is to be expected given that only an agent's teammates are capable of influencing its rewards.
  • Figure 4: Gradient estimator variance vs. episode for team collision avoidance, pressure plate, and LBF environments. Solid lines indicate the average over 5 random seeds, and shaded regions denote a 95% confidence interval. PRD-MAPPO tends to avoid the dramatic spikes in gradient variance demonstrated by MAPPO.
  • Figure 5: Average reward vs. episode for PRD-MAPPO-soft, PRD-MAPPO-shared, PRD-MAPPO-ascend, PRD-MAPPO-decay, PRD-MAPPO, PRD-MAPPO-top-K, and PRD-MAPPO-G2ANet on A) team collision avoidance, B) pursuit, C) pressure plate, D) Level-Based Foraging tasks, E) StarCraft 5 marines vs. 6 marines, F) StarCraft 10 marines vs. 11 marines, and G) StarCraft 3 Stalkers and 5 Zealots. Solid lines indicate the average over 5 random seeds, and shaded regions denote a +/- 1 standard deviation confidence interval. PRD-MAPPO-soft tended to perform the best across all tasks.