Table of Contents
Fetching ...

Is Centralized Training with Decentralized Execution Framework Centralized Enough for MARL?

Yihe Zhou, Shunyu Liu, Yunpeng Qing, Kaixuan Chen, Tongya Zheng, Jie Song, Mingli Song

TL;DR

This work critiques Centralized Training with Decentralized Execution (CTDE) for not leveraging full global information during training due to an independence assumption among agent policies. It introduces Centralized Advising and Decentralized Pruning (CADP), which enables explicit advice exchange among agents during training via a self-attention mechanism and gradually prunes cross-agent dependencies to guarantee decentralized execution. Empirical results on StarCraft II SMAC and Google Research Football show CADP outperforms traditional CTDE and teacher–student CTDE baselines across multiple backbones, with pruning yielding decentralized policies that retain strong cooperation. Overall, CADP provides a general, scalable training framework that enhances joint-policy learning while preserving fully decentralized execution, with potential broad applicability to VD and PG MARL methods.

Abstract

Centralized Training with Decentralized Execution (CTDE) has recently emerged as a popular framework for cooperative Multi-Agent Reinforcement Learning (MARL), where agents can use additional global state information to guide training in a centralized way and make their own decisions only based on decentralized local policies. Despite the encouraging results achieved, CTDE makes an independence assumption on agent policies, which limits agents to adopt global cooperative information from each other during centralized training. Therefore, we argue that existing CTDE methods cannot fully utilize global information for training, leading to an inefficient joint-policy exploration and even suboptimal results. In this paper, we introduce a novel Centralized Advising and Decentralized Pruning (CADP) framework for multi-agent reinforcement learning, that not only enables an efficacious message exchange among agents during training but also guarantees the independent policies for execution. Firstly, CADP endows agents the explicit communication channel to seek and take advices from different agents for more centralized training. To further ensure the decentralized execution, we propose a smooth model pruning mechanism to progressively constraint the agent communication into a closed one without degradation in agent cooperation capability. Empirical evaluations on StarCraft II micromanagement and Google Research Football benchmarks demonstrate that the proposed framework achieves superior performance compared with the state-of-the-art counterparts. Our code will be made publicly available.

Is Centralized Training with Decentralized Execution Framework Centralized Enough for MARL?

TL;DR

This work critiques Centralized Training with Decentralized Execution (CTDE) for not leveraging full global information during training due to an independence assumption among agent policies. It introduces Centralized Advising and Decentralized Pruning (CADP), which enables explicit advice exchange among agents during training via a self-attention mechanism and gradually prunes cross-agent dependencies to guarantee decentralized execution. Empirical results on StarCraft II SMAC and Google Research Football show CADP outperforms traditional CTDE and teacher–student CTDE baselines across multiple backbones, with pruning yielding decentralized policies that retain strong cooperation. Overall, CADP provides a general, scalable training framework that enhances joint-policy learning while preserving fully decentralized execution, with potential broad applicability to VD and PG MARL methods.

Abstract

Centralized Training with Decentralized Execution (CTDE) has recently emerged as a popular framework for cooperative Multi-Agent Reinforcement Learning (MARL), where agents can use additional global state information to guide training in a centralized way and make their own decisions only based on decentralized local policies. Despite the encouraging results achieved, CTDE makes an independence assumption on agent policies, which limits agents to adopt global cooperative information from each other during centralized training. Therefore, we argue that existing CTDE methods cannot fully utilize global information for training, leading to an inefficient joint-policy exploration and even suboptimal results. In this paper, we introduce a novel Centralized Advising and Decentralized Pruning (CADP) framework for multi-agent reinforcement learning, that not only enables an efficacious message exchange among agents during training but also guarantees the independent policies for execution. Firstly, CADP endows agents the explicit communication channel to seek and take advices from different agents for more centralized training. To further ensure the decentralized execution, we propose a smooth model pruning mechanism to progressively constraint the agent communication into a closed one without degradation in agent cooperation capability. Empirical evaluations on StarCraft II micromanagement and Google Research Football benchmarks demonstrate that the proposed framework achieves superior performance compared with the state-of-the-art counterparts. Our code will be made publicly available.
Paper Structure (28 sections, 9 equations, 14 figures, 5 tables, 2 algorithms)

This paper contains 28 sections, 9 equations, 14 figures, 5 tables, 2 algorithms.

Figures (14)

  • Figure 1: Comparisons between existing frameworks and our CADP. (a) Basic CTDE framework. Each agent learns its individual policy by optimizing the joint value of the centralized module with the global state. (b) Teacher-student CTDE framework. This framework introduces knowledge distillation to improve agent learning, where teachers use global information and students use local information. (c) Our CADP framework. Agents exchange their advice during centralized training then prune the dependence (still with reinforcement learning) for decentralized execution.
  • Figure 2: Illustrative diagram of the proposed Centralized Advising and Decentralized Pruning (CADP) framework. At centralized training stage, the agent model will use the $Q$, $K$, $V$ modules, while at decentralized execution stage, the agent model only uses $V$ module.
  • Figure 3: Learning curves of our method and baselines on the SMAC scenarios. (Upper) Comparision with the methods under the CTDE and DTDE frameworks. (Lower) Comparision with the methods under the teacher-student CTDE framework. CADP(C) means our centralized model, while CADP(D) means our decentralized model which will be used for decentralized execution.
  • Figure 4: Learning curves of our CADP method and baselines on the Google Research Football (GRF) scenarios.
  • Figure 5: Ablation study on different coefficients $\alpha$. The left part is the learning curves for 4M timesteps and the right part is the average test win rate of last 0.1M timesteps.
  • ...and 9 more figures