Towards Global Optimality in Cooperative MARL with the Transformation And Distillation Framework
Jianing Ye, Chenghao Li, Yongqiang Dou, Jianhao Wang, Guangwen Yang, Chongjie Zhang
TL;DR
The paper tackles the suboptimality of widely used cooperative MARL algorithms that rely on decentralized execution and gradient descent optimization. It proves that MA-PG and VD can converge to suboptimal policies under gradient descent due to constrained decentralized parameterizations, and it introduces Transformation And Distillation (TAD) to remedy this. TAD re-frames a multi-agent MMDP as a sequential single-agent MDP via the transformation $\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Gamma$, learns an optimal coordination policy with a standard SARL method, and then distills decentralized policies through KL-based behavior cloning, preserving optimality (Theorem: TAD keeps optimality). The resulting TAD-PPO achieves strong empirical performance across simple matrix games, multi-step hallway tasks, StarCraft II SMAC, and Google Research Football, often surpassing state-of-the-art baselines while providing formal optimality guarantees (including an optimality proof for finite MMDPs). This work highlights the critical role of optimization structure in MARL and offers a principled framework to translate single-agent guarantees into decentralized coordination with practical scalability.
Abstract
Decentralized execution is one core demand in multi-agent reinforcement learning (MARL). Recently, most popular MARL algorithms have adopted decentralized policies to enable decentralized execution, and use gradient descent as the optimizer. However, there is hardly any theoretical analysis of these algorithms taking the optimization method into consideration, and we find that various popular MARL algorithms with decentralized policies are suboptimal in toy tasks when gradient descent is chosen as their optimization method. In this paper, we theoretically analyze two common classes of algorithms with decentralized policies -- multi-agent policy gradient methods and value-decomposition methods, and prove their suboptimality when gradient descent is used. To address the suboptimality issue, we propose the Transformation And Distillation (TAD) framework, which reformulates a multi-agent MDP as a special single-agent MDP with a sequential structure and enables decentralized execution by distilling the learned policy on the derived "single-agent" MDP. The approach is a two-stage learning paradigm that addresses the optimization problem in cooperative MARL, providing optimality guarantee with decent execution performance. Empirically, we implement TAD-PPO based on PPO, which can theoretically perform optimal policy learning in the finite multi-agent MDPs and shows significant outperformance on a large set of cooperative multi-agent tasks, from matrix game, hallway task, to StarCraft II, and football game.
