Double Distillation Network for Multi-Agent Reinforcement Learning
Yang Zhou, Siying Wang, Wenyu Chen, Ruoning Zhang, Zhitong Zhao, Zixuan Zhang
TL;DR
The paper tackles non-stationarity and partial observability in cooperative multi-agent reinforcement learning under CTDE. It introduces the Double Distillation Network (DDN), which combines an External Distillation Module (Global Guiding Network and Local Policy Network) with an Internal Distillation Module to both align centralized training with decentralized execution and drive exploration through intrinsic rewards derived from global state features. The external pathway uses personalized global information to reduce the gap between the leader and follower networks, while the internal pathway injects state-informed intrinsic rewards to boost exploration. Extensive experiments on SMAC and Predator-Prey show that DDN improves coordination and training efficiency, with ablations confirming the contributions of personalization, multi-level distillation, and intrinsic reward mechanisms. Overall, DDN provides a practical, scalable approach to leverage global state information while maintaining decentralized execution in complex, partially observable MARL settings.
Abstract
Multi-agent reinforcement learning typically employs a centralized training-decentralized execution (CTDE) framework to alleviate the non-stationarity in environment. However, the partial observability during execution may lead to cumulative gap errors gathered by agents, impairing the training of effective collaborative policies. To overcome this challenge, we introduce the Double Distillation Network (DDN), which incorporates two distillation modules aimed at enhancing robust coordination and facilitating the collaboration process under constrained information. The external distillation module uses a global guiding network and a local policy network, employing distillation to reconcile the gap between global training and local execution. In addition, the internal distillation module introduces intrinsic rewards, drawn from state information, to enhance the exploration capabilities of agents. Extensive experiments demonstrate that DDN significantly improves performance across multiple scenarios.
