Table of Contents
Fetching ...

Towards Global Optimality in Cooperative MARL with the Transformation And Distillation Framework

Jianing Ye, Chenghao Li, Yongqiang Dou, Jianhao Wang, Guangwen Yang, Chongjie Zhang

TL;DR

The paper tackles the suboptimality of widely used cooperative MARL algorithms that rely on decentralized execution and gradient descent optimization. It proves that MA-PG and VD can converge to suboptimal policies under gradient descent due to constrained decentralized parameterizations, and it introduces Transformation And Distillation (TAD) to remedy this. TAD re-frames a multi-agent MMDP as a sequential single-agent MDP via the transformation $\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Gamma$, learns an optimal coordination policy with a standard SARL method, and then distills decentralized policies through KL-based behavior cloning, preserving optimality (Theorem: TAD keeps optimality). The resulting TAD-PPO achieves strong empirical performance across simple matrix games, multi-step hallway tasks, StarCraft II SMAC, and Google Research Football, often surpassing state-of-the-art baselines while providing formal optimality guarantees (including an optimality proof for finite MMDPs). This work highlights the critical role of optimization structure in MARL and offers a principled framework to translate single-agent guarantees into decentralized coordination with practical scalability.

Abstract

Decentralized execution is one core demand in multi-agent reinforcement learning (MARL). Recently, most popular MARL algorithms have adopted decentralized policies to enable decentralized execution, and use gradient descent as the optimizer. However, there is hardly any theoretical analysis of these algorithms taking the optimization method into consideration, and we find that various popular MARL algorithms with decentralized policies are suboptimal in toy tasks when gradient descent is chosen as their optimization method. In this paper, we theoretically analyze two common classes of algorithms with decentralized policies -- multi-agent policy gradient methods and value-decomposition methods, and prove their suboptimality when gradient descent is used. To address the suboptimality issue, we propose the Transformation And Distillation (TAD) framework, which reformulates a multi-agent MDP as a special single-agent MDP with a sequential structure and enables decentralized execution by distilling the learned policy on the derived "single-agent" MDP. The approach is a two-stage learning paradigm that addresses the optimization problem in cooperative MARL, providing optimality guarantee with decent execution performance. Empirically, we implement TAD-PPO based on PPO, which can theoretically perform optimal policy learning in the finite multi-agent MDPs and shows significant outperformance on a large set of cooperative multi-agent tasks, from matrix game, hallway task, to StarCraft II, and football game.

Towards Global Optimality in Cooperative MARL with the Transformation And Distillation Framework

TL;DR

The paper tackles the suboptimality of widely used cooperative MARL algorithms that rely on decentralized execution and gradient descent optimization. It proves that MA-PG and VD can converge to suboptimal policies under gradient descent due to constrained decentralized parameterizations, and it introduces Transformation And Distillation (TAD) to remedy this. TAD re-frames a multi-agent MMDP as a sequential single-agent MDP via the transformation , learns an optimal coordination policy with a standard SARL method, and then distills decentralized policies through KL-based behavior cloning, preserving optimality (Theorem: TAD keeps optimality). The resulting TAD-PPO achieves strong empirical performance across simple matrix games, multi-step hallway tasks, StarCraft II SMAC, and Google Research Football, often surpassing state-of-the-art baselines while providing formal optimality guarantees (including an optimality proof for finite MMDPs). This work highlights the critical role of optimization structure in MARL and offers a principled framework to translate single-agent guarantees into decentralized coordination with practical scalability.

Abstract

Decentralized execution is one core demand in multi-agent reinforcement learning (MARL). Recently, most popular MARL algorithms have adopted decentralized policies to enable decentralized execution, and use gradient descent as the optimizer. However, there is hardly any theoretical analysis of these algorithms taking the optimization method into consideration, and we find that various popular MARL algorithms with decentralized policies are suboptimal in toy tasks when gradient descent is chosen as their optimization method. In this paper, we theoretically analyze two common classes of algorithms with decentralized policies -- multi-agent policy gradient methods and value-decomposition methods, and prove their suboptimality when gradient descent is used. To address the suboptimality issue, we propose the Transformation And Distillation (TAD) framework, which reformulates a multi-agent MDP as a special single-agent MDP with a sequential structure and enables decentralized execution by distilling the learned policy on the derived "single-agent" MDP. The approach is a two-stage learning paradigm that addresses the optimization problem in cooperative MARL, providing optimality guarantee with decent execution performance. Empirically, we implement TAD-PPO based on PPO, which can theoretically perform optimal policy learning in the finite multi-agent MDPs and shows significant outperformance on a large set of cooperative multi-agent tasks, from matrix game, hallway task, to StarCraft II, and football game.
Paper Structure (52 sections, 11 theorems, 10 equations, 14 figures, 14 tables, 3 algorithms)

This paper contains 52 sections, 11 theorems, 10 equations, 14 figures, 14 tables, 3 algorithms.

Key Result

Theorem 4.1

There are tasks such that, when the parameter $\Theta$ is initialized in certain region $\Omega$ with positive volume, MA-PG (eq:pi-factoreq:mapg-loss) converges to a suboptimal policy with a small enough learning rate $\alpha$.

Figures (14)

  • Figure 1: The architecture of transformation and distillation framework with PPO (TAD-PPO). WK, WQ, and V are key matrix, query matrix, and value matrix respectively in the MHA module attentionisallyouneed.
  • Figure 2: Learning curves in the didactic example.
  • Figure 3: Payoff of one matrix game with local optimal points and learned joint policies based on our approach TAD-PPO and MAPPO.
  • Figure 4: Performance comparison on SMAC benchmark.
  • Figure 5: Performance comparison on GRF benchmark.
  • ...and 9 more figures

Theorems & Definitions (14)

  • Theorem 4.1
  • Theorem 4.2
  • Theorem 5.1
  • Theorem 5.2
  • Definition B.1: $L$-smoothness
  • Lemma B.2
  • Lemma B.3
  • Lemma B.4
  • Lemma B.5
  • Definition B.11: Completeness of $Q$-function Class
  • ...and 4 more