Table of Contents
Fetching ...

MALinZero: Efficient Low-Dimensional Search for Mastering Complex Multi-Agent Planning

Sizhe Tang, Jiayu Chen, Tian Lan

TL;DR

The paper tackles the challenge of exponential joint-action spaces in cooperative multi-agent planning with Monte Carlo Tree Search. It introduces MALinZero, which casts the joint-action learning problem as a contextual linear bandit over a low-dimensional space of per-agent rewards and derives LinUCT to guide exploration and exploitation, supported by a regret bound $\hat{R}_T = O(nd \sqrt{μ T} \ln(T))$ and a $(1-\tfrac{1}{e})$-approximation for joint-action selection via submodular maximization. By reducing the effective action space from $d^n$ to $nd$, MALinZero achieves state-of-the-art performance on MatGame, SMAC, and SMACv2 with faster learning and robust performance in large-scale settings. The method combines a six-network architectural framework, dynamic node generation, and a theoretical backbone for exploration, providing a practical and scalable solution for multi-agent MCTS. Overall, MALinZero advances efficient planning in multi-agent systems by coupling low-dimensional representations with principled exploration guarantees and strong empirical results.

Abstract

Monte Carlo Tree Search (MCTS), which leverages Upper Confidence Bound for Trees (UCTs) to balance exploration and exploitation through randomized sampling, is instrumental to solving complex planning problems. However, for multi-agent planning, MCTS is confronted with a large combinatorial action space that often grows exponentially with the number of agents. As a result, the branching factor of MCTS during tree expansion also increases exponentially, making it very difficult to efficiently explore and exploit during tree search. To this end, we propose MALinZero, a new approach to leverage low-dimensional representational structures on joint-action returns and enable efficient MCTS in complex multi-agent planning. Our solution can be viewed as projecting the joint-action returns into the low-dimensional space representable using a contextual linear bandit problem formulation. We solve the contextual linear bandit problem with convex and $μ$-smooth loss functions -- in order to place more importance on better joint actions and mitigate potential representational limitations -- and derive a linear Upper Confidence Bound applied to trees (LinUCT) to enable novel multi-agent exploration and exploitation in the low-dimensional space. We analyze the regret of MALinZero for low-dimensional reward functions and propose an $(1-\tfrac1e)$-approximation algorithm for the joint action selection by maximizing a sub-modular objective. MALinZero demonstrates state-of-the-art performance on multi-agent benchmarks such as matrix games, SMAC, and SMACv2, outperforming both model-based and model-free multi-agent reinforcement learning baselines with faster learning speed and better performance.

MALinZero: Efficient Low-Dimensional Search for Mastering Complex Multi-Agent Planning

TL;DR

The paper tackles the challenge of exponential joint-action spaces in cooperative multi-agent planning with Monte Carlo Tree Search. It introduces MALinZero, which casts the joint-action learning problem as a contextual linear bandit over a low-dimensional space of per-agent rewards and derives LinUCT to guide exploration and exploitation, supported by a regret bound and a -approximation for joint-action selection via submodular maximization. By reducing the effective action space from to , MALinZero achieves state-of-the-art performance on MatGame, SMAC, and SMACv2 with faster learning and robust performance in large-scale settings. The method combines a six-network architectural framework, dynamic node generation, and a theoretical backbone for exploration, providing a practical and scalable solution for multi-agent MCTS. Overall, MALinZero advances efficient planning in multi-agent systems by coupling low-dimensional representations with principled exploration guarantees and strong empirical results.

Abstract

Monte Carlo Tree Search (MCTS), which leverages Upper Confidence Bound for Trees (UCTs) to balance exploration and exploitation through randomized sampling, is instrumental to solving complex planning problems. However, for multi-agent planning, MCTS is confronted with a large combinatorial action space that often grows exponentially with the number of agents. As a result, the branching factor of MCTS during tree expansion also increases exponentially, making it very difficult to efficiently explore and exploit during tree search. To this end, we propose MALinZero, a new approach to leverage low-dimensional representational structures on joint-action returns and enable efficient MCTS in complex multi-agent planning. Our solution can be viewed as projecting the joint-action returns into the low-dimensional space representable using a contextual linear bandit problem formulation. We solve the contextual linear bandit problem with convex and -smooth loss functions -- in order to place more importance on better joint actions and mitigate potential representational limitations -- and derive a linear Upper Confidence Bound applied to trees (LinUCT) to enable novel multi-agent exploration and exploitation in the low-dimensional space. We analyze the regret of MALinZero for low-dimensional reward functions and propose an -approximation algorithm for the joint action selection by maximizing a sub-modular objective. MALinZero demonstrates state-of-the-art performance on multi-agent benchmarks such as matrix games, SMAC, and SMACv2, outperforming both model-based and model-free multi-agent reinforcement learning baselines with faster learning speed and better performance.

Paper Structure

This paper contains 33 sections, 13 theorems, 89 equations, 3 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

[Regret Bound of LinUCT] With probability $1-\delta$, the regret of LinUCT satisfies

Figures (3)

  • Figure 1: Evaluations on 3 SMAC tasks/maps. Y-axis denotes the win rate and X-axis denotes training steps. Each algorithm is executed with 3 random seeds. MALinZero achieves over 95% winning rate on all 3 maps, outperforming all baselines and also gets high winning rate much faster.
  • Figure 2: Comparisons on 3 SMACv2 tasks/maps.Y-axis denotes the win rate and X-axis denotes the training steps. MALinZero nearly doubles the winning rate on these challenging maps in SMACv2 and consistently outperforms all baselines. Each algorithm is executed with 3 random seeds.
  • Figure 3: Ablation study of MALinZero by removing various design components, such as DNG and the introduction of general convex loss $f$ in the contextual bandit problem.

Theorems & Definitions (21)

  • Theorem 1
  • Corollary 2: The Order of Regret Bound for LinUCT
  • Theorem 3
  • Theorem 4
  • Theorem 5: Complexity of the Back-Propagation to update $\hat{\theta}_t$ and $V_t^{-1}A$
  • Lemma 1: Confidence Ellipsoid
  • proof
  • Lemma 2: Self‑Normalized Martingale Tail
  • proof
  • Lemma 3: Elliptical Potential
  • ...and 11 more