Table of Contents
Fetching ...

PMAT: Optimizing Action Generation Order in Multi-Agent Reinforcement Learning

Kun Hu, Muning Wen, Xihuai Wang, Shao Zhang, Yiwei Shi, Minne Li, Minglong Li, Ying Wen

TL;DR

PMAT addresses inter-agent dependencies in multi-agent reinforcement learning by learning an optimal action-generation order through AGPS, a Plackett-Luce sampling-based mechanism. It integrates AGPS with the Multi-Agent Transformer to create a sequential, order-aware MARL algorithm that assigns decision credits based on local observations. The approach delivers consistent performance gains across StarCraft II SMAC, Google Research Football, and Multi-Agent MuJoCo benchmarks, demonstrating improved coordination, stability, and sample efficiency over state-of-the-art baselines. This work highlights the importance of dynamic, credit-based ordering for cooperative policies and provides a scalable framework for optimizing decision order in complex multi-agent tasks.

Abstract

Multi-agent reinforcement learning (MARL) faces challenges in coordinating agents due to complex interdependencies within multi-agent systems. Most MARL algorithms use the simultaneous decision-making paradigm but ignore the action-level dependencies among agents, which reduces coordination efficiency. In contrast, the sequential decision-making paradigm provides finer-grained supervision for agent decision order, presenting the potential for handling dependencies via better decision order management. However, determining the optimal decision order remains a challenge. In this paper, we introduce Action Generation with Plackett-Luce Sampling (AGPS), a novel mechanism for agent decision order optimization. We model the order determination task as a Plackett-Luce sampling process to address issues such as ranking instability and vanishing gradient during the network training process. AGPS realizes credit-based decision order determination by establishing a bridge between the significance of agents' local observations and their decision credits, thus facilitating order optimization and dependency management. Integrating AGPS with the Multi-Agent Transformer, we propose the Prioritized Multi-Agent Transformer (PMAT), a sequential decision-making MARL algorithm with decision order optimization. Experiments on benchmarks including StarCraft II Multi-Agent Challenge, Google Research Football, and Multi-Agent MuJoCo show that PMAT outperforms state-of-the-art algorithms, greatly enhancing coordination efficiency.

PMAT: Optimizing Action Generation Order in Multi-Agent Reinforcement Learning

TL;DR

PMAT addresses inter-agent dependencies in multi-agent reinforcement learning by learning an optimal action-generation order through AGPS, a Plackett-Luce sampling-based mechanism. It integrates AGPS with the Multi-Agent Transformer to create a sequential, order-aware MARL algorithm that assigns decision credits based on local observations. The approach delivers consistent performance gains across StarCraft II SMAC, Google Research Football, and Multi-Agent MuJoCo benchmarks, demonstrating improved coordination, stability, and sample efficiency over state-of-the-art baselines. This work highlights the importance of dynamic, credit-based ordering for cooperative policies and provides a scalable framework for optimizing decision order in complex multi-agent tasks.

Abstract

Multi-agent reinforcement learning (MARL) faces challenges in coordinating agents due to complex interdependencies within multi-agent systems. Most MARL algorithms use the simultaneous decision-making paradigm but ignore the action-level dependencies among agents, which reduces coordination efficiency. In contrast, the sequential decision-making paradigm provides finer-grained supervision for agent decision order, presenting the potential for handling dependencies via better decision order management. However, determining the optimal decision order remains a challenge. In this paper, we introduce Action Generation with Plackett-Luce Sampling (AGPS), a novel mechanism for agent decision order optimization. We model the order determination task as a Plackett-Luce sampling process to address issues such as ranking instability and vanishing gradient during the network training process. AGPS realizes credit-based decision order determination by establishing a bridge between the significance of agents' local observations and their decision credits, thus facilitating order optimization and dependency management. Integrating AGPS with the Multi-Agent Transformer, we propose the Prioritized Multi-Agent Transformer (PMAT), a sequential decision-making MARL algorithm with decision order optimization. Experiments on benchmarks including StarCraft II Multi-Agent Challenge, Google Research Football, and Multi-Agent MuJoCo show that PMAT outperforms state-of-the-art algorithms, greatly enhancing coordination efficiency.

Paper Structure

This paper contains 24 sections, 1 theorem, 19 equations, 7 figures, 9 tables.

Key Result

theorem 1

Let $i_{1:m}$ be a permutation of agents and $i_k$ denote the $k^{th}$ agent within $i_{1:m}$. Then, for joint observation $\bm{o} = \bm{o} \in \bm{\mathcal{O}}$ and joint action $\bm{a} = \bm{a}^{i_{1:m}} \in \bm{\mathcal{A}}$, the following equation always holds,

Figures (7)

  • Figure 1: The simultaneous action generation paradigm generates agents' actions concurrently and interacts with the environment once. The sequential action generation paradigm generates agents' actions in an agent-by-agent manner, providing finer-grained supervision for the action generation order. Agents can interact with the environment once per decision or once per iteration under this paradigm.
  • Figure 2: A multi-agent cooperation scenario taken from Google Research Football. Player JOHNSON passes the ball to his partner TURING who has a favorable shooting angle (left), and TURING converts the shot into a goal (right).
  • Figure 3: The overall framework of the proposed Prioritized Multi-Agent Transformer. The encoder processes agents' local observations at each time step, transforming them into high-level representations. The observation representations are subsequently fed into the scoring block to generate individual preference scores, referred to as decision credits. P-L sampling is then conducted based on the scoring to compute the action generation order. The representations are reordered prior to being fed into the decoder which sequentially generates agents' actions in accordance with this reordered sequence.
  • Figure 4: Experimental results in StarCraft II, Google Research Football, and Multi-Agent MuJoCo.
  • Figure 5: Ablation results in StarCraft II, Google Research Football, and Multi-Agent MuJoCo.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Definition 1: Multi-Agent Advantage Function kuba2022trust
  • theorem 1: Multi-Agent Advantage Decomposition wen2022multi
  • Definition 2: Optimal Action Generation Order