MALinZero: Efficient Low-Dimensional Search for Mastering Complex Multi-Agent Planning
Sizhe Tang, Jiayu Chen, Tian Lan
TL;DR
The paper tackles the challenge of exponential joint-action spaces in cooperative multi-agent planning with Monte Carlo Tree Search. It introduces MALinZero, which casts the joint-action learning problem as a contextual linear bandit over a low-dimensional space of per-agent rewards and derives LinUCT to guide exploration and exploitation, supported by a regret bound $\hat{R}_T = O(nd \sqrt{μ T} \ln(T))$ and a $(1-\tfrac{1}{e})$-approximation for joint-action selection via submodular maximization. By reducing the effective action space from $d^n$ to $nd$, MALinZero achieves state-of-the-art performance on MatGame, SMAC, and SMACv2 with faster learning and robust performance in large-scale settings. The method combines a six-network architectural framework, dynamic node generation, and a theoretical backbone for exploration, providing a practical and scalable solution for multi-agent MCTS. Overall, MALinZero advances efficient planning in multi-agent systems by coupling low-dimensional representations with principled exploration guarantees and strong empirical results.
Abstract
Monte Carlo Tree Search (MCTS), which leverages Upper Confidence Bound for Trees (UCTs) to balance exploration and exploitation through randomized sampling, is instrumental to solving complex planning problems. However, for multi-agent planning, MCTS is confronted with a large combinatorial action space that often grows exponentially with the number of agents. As a result, the branching factor of MCTS during tree expansion also increases exponentially, making it very difficult to efficiently explore and exploit during tree search. To this end, we propose MALinZero, a new approach to leverage low-dimensional representational structures on joint-action returns and enable efficient MCTS in complex multi-agent planning. Our solution can be viewed as projecting the joint-action returns into the low-dimensional space representable using a contextual linear bandit problem formulation. We solve the contextual linear bandit problem with convex and $μ$-smooth loss functions -- in order to place more importance on better joint actions and mitigate potential representational limitations -- and derive a linear Upper Confidence Bound applied to trees (LinUCT) to enable novel multi-agent exploration and exploitation in the low-dimensional space. We analyze the regret of MALinZero for low-dimensional reward functions and propose an $(1-\tfrac1e)$-approximation algorithm for the joint action selection by maximizing a sub-modular objective. MALinZero demonstrates state-of-the-art performance on multi-agent benchmarks such as matrix games, SMAC, and SMACv2, outperforming both model-based and model-free multi-agent reinforcement learning baselines with faster learning speed and better performance.
