Table of Contents
Fetching ...

Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning

Jakub Grudzien Kuba, Ruiqing Chen, Muning Wen, Ying Wen, Fanglei Sun, Jun Wang, Yaodong Yang

TL;DR

The paper extends trust-region policy optimization to multi-agent settings by proving a multi-agent advantage decomposition, enabling a sequential per-agent update scheme with monotonic joint-policy improvement. It introduces HATRPO and HAPPO, which allow heterogeneous agents and do not require joint value function decomposability, yet guarantee improvement and convergence to Nash equilibria. Empirically, these methods achieve state-of-the-art results on both StarCraft II and Multi-Agent MuJoCo benchmarks, outperforming strong baselines while avoiding parameter sharing. This work advances practical and theoretically grounded MARL by delivering scalable, monotonic-trust-region algorithms applicable to heterogeneous agent teams.

Abstract

Trust region methods rigorously enabled reinforcement learning (RL) agents to learn monotonically improving policies, leading to superior performance on a variety of tasks. Unfortunately, when it comes to multi-agent reinforcement learning (MARL), the property of monotonic improvement may not simply apply; this is because agents, even in cooperative games, could have conflicting directions of policy updates. As a result, achieving a guaranteed improvement on the joint policy where each agent acts individually remains an open challenge. In this paper, we extend the theory of trust region learning to MARL. Central to our findings are the multi-agent advantage decomposition lemma and the sequential policy update scheme. Based on these, we develop Heterogeneous-Agent Trust Region Policy Optimisation (HATPRO) and Heterogeneous-Agent Proximal Policy Optimisation (HAPPO) algorithms. Unlike many existing MARL algorithms, HATRPO/HAPPO do not need agents to share parameters, nor do they need any restrictive assumptions on decomposibility of the joint value function. Most importantly, we justify in theory the monotonic improvement property of HATRPO/HAPPO. We evaluate the proposed methods on a series of Multi-Agent MuJoCo and StarCraftII tasks. Results show that HATRPO and HAPPO significantly outperform strong baselines such as IPPO, MAPPO and MADDPG on all tested tasks, therefore establishing a new state of the art.

Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning

TL;DR

The paper extends trust-region policy optimization to multi-agent settings by proving a multi-agent advantage decomposition, enabling a sequential per-agent update scheme with monotonic joint-policy improvement. It introduces HATRPO and HAPPO, which allow heterogeneous agents and do not require joint value function decomposability, yet guarantee improvement and convergence to Nash equilibria. Empirically, these methods achieve state-of-the-art results on both StarCraft II and Multi-Agent MuJoCo benchmarks, outperforming strong baselines while avoiding parameter sharing. This work advances practical and theoretically grounded MARL by delivering scalable, monotonic-trust-region algorithms applicable to heterogeneous agent teams.

Abstract

Trust region methods rigorously enabled reinforcement learning (RL) agents to learn monotonically improving policies, leading to superior performance on a variety of tasks. Unfortunately, when it comes to multi-agent reinforcement learning (MARL), the property of monotonic improvement may not simply apply; this is because agents, even in cooperative games, could have conflicting directions of policy updates. As a result, achieving a guaranteed improvement on the joint policy where each agent acts individually remains an open challenge. In this paper, we extend the theory of trust region learning to MARL. Central to our findings are the multi-agent advantage decomposition lemma and the sequential policy update scheme. Based on these, we develop Heterogeneous-Agent Trust Region Policy Optimisation (HATPRO) and Heterogeneous-Agent Proximal Policy Optimisation (HAPPO) algorithms. Unlike many existing MARL algorithms, HATRPO/HAPPO do not need agents to share parameters, nor do they need any restrictive assumptions on decomposibility of the joint value function. Most importantly, we justify in theory the monotonic improvement property of HATRPO/HAPPO. We evaluate the proposed methods on a series of Multi-Agent MuJoCo and StarCraftII tasks. Results show that HATRPO and HAPPO significantly outperform strong baselines such as IPPO, MAPPO and MADDPG on all tested tasks, therefore establishing a new state of the art.

Paper Structure

This paper contains 31 sections, 22 theorems, 70 equations, 6 figures, 7 tables, 3 algorithms.

Key Result

theorem 1

trpo Let $\pi$ be the current policy and $\bar{\pi}$ be the next candidate policy. We define $L_{\pi}(\bar{\pi}) = J(\pi) + \mathbb{E}_{{\textnormal{s}}\sim\rho_{\pi}, {\textnormal{a}}\sim\bar{\pi}}\left[ A_{\pi}(s, a) \right], \text{\normalfont D}_{\text{KL}}^{\text{max}}(\pi, \bar{\pi}) = \max_{s} holds, where $C = \frac{4\gamma\max_{s, a}|A_{\pi}(s, a)|}{(1-\gamma)^{2}}$.

Figures (6)

  • Figure 1: Example of a two-player differentiable game with $r(a^1, a^2)=a^1a^2$. We initialise two Gaussian policies with $\mu^1 = -0.25$, $\mu^2=0.25$. The purple intervals represent the KL-ball of $\delta=0.5$. Individual trust region updates (red) decrease the joint return, whereas our sequential update (blue) leads to improvement.
  • Figure 2: Performance comparisons between HATRPO/HAPPO and MAPPO on three SMAC tasks. Since all methods achieve 100% win rate, we believe SMAC is not sufficiently difficult to discriminate the capabilities of these algorithms, especially when non-parameter sharing is not required.
  • Figure 3: Performance comparison on multiple Multi-Agent MuJoCo tasks. HAPPO and HATRPO consistently outperform their rivals, thus establishing a new state-of-the-art algorithm for MARL. The performance gap enlarges with increasing number of agents.
  • Figure 4: Performance comparison between original HATRPO, and its modified versions: HATRPO with parameter sharing, and HATRPO without randomisation of the sequential update scheme.
  • Figure 5: Performance comparison between HATRPO and MAPPO/IPPO without parameter sharing. HATRPO significantly outperforms its counterparts.
  • ...and 1 more figures

Theorems & Definitions (41)

  • definition 1
  • theorem 1
  • proposition 1
  • lemma 1: Multi-Agent Advantage Decomposition
  • definition 2
  • lemma 2
  • theorem 2
  • definition 3
  • theorem 3
  • proposition 2
  • ...and 31 more