Table of Contents
Fetching ...

Efficient Multi-agent Reinforcement Learning by Planning

Qihan Liu, Jianing Ye, Xiaoteng Ma, Jun Yang, Bin Liang, Chongjie Zhang

TL;DR

This work tackles sample inefficiency in cooperative multi-agent reinforcement learning by extending MuZero with a centralized-value, decentralized-dynamics model and planning via MCTS. It introduces OS($\lambda$), an optimistic planning mechanism, and AWPO, a policy objective that leverages optimistic value information to improve action selection in large, multi-agent action spaces. The MAZero architecture incorporates agent-specific representations, communication through attention, and shared parameters to capture coordination while enabling distributed execution. Empirical results on SMAC show superior data efficiency and strong performance relative to both model-free baselines and existing model-based methods, highlighting the practical impact of planning-informed MARL with CTDE and the proposed planning enhancements.

Abstract

Multi-agent reinforcement learning (MARL) algorithms have accomplished remarkable breakthroughs in solving large-scale decision-making tasks. Nonetheless, most existing MARL algorithms are model-free, limiting sample efficiency and hindering their applicability in more challenging scenarios. In contrast, model-based reinforcement learning (MBRL), particularly algorithms integrating planning, such as MuZero, has demonstrated superhuman performance with limited data in many tasks. Hence, we aim to boost the sample efficiency of MARL by adopting model-based approaches. However, incorporating planning and search methods into multi-agent systems poses significant challenges. The expansive action space of multi-agent systems often necessitates leveraging the nearly-independent property of agents to accelerate learning. To tackle this issue, we propose the MAZero algorithm, which combines a centralized model with Monte Carlo Tree Search (MCTS) for policy search. We design a novel network structure to facilitate distributed execution and parameter sharing. To enhance search efficiency in deterministic environments with sizable action spaces, we introduce two novel techniques: Optimistic Search Lambda (OS($λ$)) and Advantage-Weighted Policy Optimization (AWPO). Extensive experiments on the SMAC benchmark demonstrate that MAZero outperforms model-free approaches in terms of sample efficiency and provides comparable or better performance than existing model-based methods in terms of both sample and computational efficiency. Our code is available at https://github.com/liuqh16/MAZero.

Efficient Multi-agent Reinforcement Learning by Planning

TL;DR

This work tackles sample inefficiency in cooperative multi-agent reinforcement learning by extending MuZero with a centralized-value, decentralized-dynamics model and planning via MCTS. It introduces OS(), an optimistic planning mechanism, and AWPO, a policy objective that leverages optimistic value information to improve action selection in large, multi-agent action spaces. The MAZero architecture incorporates agent-specific representations, communication through attention, and shared parameters to capture coordination while enabling distributed execution. Empirical results on SMAC show superior data efficiency and strong performance relative to both model-free baselines and existing model-based methods, highlighting the practical impact of planning-informed MARL with CTDE and the proposed planning enhancements.

Abstract

Multi-agent reinforcement learning (MARL) algorithms have accomplished remarkable breakthroughs in solving large-scale decision-making tasks. Nonetheless, most existing MARL algorithms are model-free, limiting sample efficiency and hindering their applicability in more challenging scenarios. In contrast, model-based reinforcement learning (MBRL), particularly algorithms integrating planning, such as MuZero, has demonstrated superhuman performance with limited data in many tasks. Hence, we aim to boost the sample efficiency of MARL by adopting model-based approaches. However, incorporating planning and search methods into multi-agent systems poses significant challenges. The expansive action space of multi-agent systems often necessitates leveraging the nearly-independent property of agents to accelerate learning. To tackle this issue, we propose the MAZero algorithm, which combines a centralized model with Monte Carlo Tree Search (MCTS) for policy search. We design a novel network structure to facilitate distributed execution and parameter sharing. To enhance search efficiency in deterministic environments with sizable action spaces, we introduce two novel techniques: Optimistic Search Lambda (OS()) and Advantage-Weighted Policy Optimization (AWPO). Extensive experiments on the SMAC benchmark demonstrate that MAZero outperforms model-free approaches in terms of sample efficiency and provides comparable or better performance than existing model-based methods in terms of both sample and computational efficiency. Our code is available at https://github.com/liuqh16/MAZero.
Paper Structure (34 sections, 19 equations, 11 figures, 11 tables)

This paper contains 34 sections, 19 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Bandit Experiment We compare the Behavior Cloning (BC) loss and our Advantage-Weighted Policy Optimization (AWPO) loss on a bandit with action space $|\mathcal{A}| = 100$ and sampling time $B=2$. It is evident that AWPO converges much faster than BC, owing to the effective utilization of values.
  • Figure 2: MAZero model structure Given the current observations $o_t^i$ for each agent, the model separately maps them into local hidden states $s_{t,0}^i$ using a shared representation network $h$. Value prediction $v_{t,0}$ is computed based on the global hidden state $\pmb{s}_{t,0}$ while policy priors $p_{t,0}^i$ are individually calculated for each agent using their corresponding local hidden states. Agents use the communication network $e$ to access team information $e_{t,0}^i$ and generate next local hidden states $s_{t,1}^i$ via the shared individual dynamic network $g$, subsequently deriving reward $r_{t,1}$, value $v_{t,1}$ and policies $p_{t,1}^i$. During the training stage, real future observations $\pmb{o}_{t+1}$ can be obtained to generate the target for the next hidden state, denoted as $\pmb{s}_{t+1,0}$.
  • Figure 3: Comparisons against baselines in SMAC. Y axis denotes the win rate and X axis denotes the number of steps taken in the environment. Each algorithm is executed with 10 random seeds.
  • Figure 4: Comparisons against MBRL baselines in SMAC. The y-axis denotes the win rate, and the X-axis denotes the cumulative run time of algorithms in the same platform.
  • Figure 5: Ablation study on proposed approaches for planning.
  • ...and 6 more figures