Table of Contents
Fetching ...

PPS-QMIX: Periodically Parameter Sharing for Accelerating Convergence of Multi-Agent Reinforcement Learning

Ke Zhang, DanDan Zhu, Qiuhan Xu, Hao Zhou, Ce Zheng

TL;DR

The paper addresses slow convergence in multi-agent reinforcement learning caused by distribution drift among agents. It introduces three periodically parameter sharing variants—A-PPS, RS-PPS, and PP-PPS—that periodically share parts or whole components of the QMIX value network across agents, leveraging ideas from Federated Learning to mitigate non-IID exploration. Empirical results in StarCraft SMAC show average performance gains of 10–30% and enable solving tasks that QMIX cannot, with RS-PPS delivering the strongest results in high-dimensional scenarios. The proposed methods are compatible with existing value-function factorization approaches and offer a practical pathway to faster, more robust MARL training without sharing raw trajectories.

Abstract

Training for multi-agent reinforcement learning(MARL) is a time-consuming process caused by distribution shift of each agent. One drawback is that strategy of each agent in MARL is independent but actually in cooperation. Thus, a vertical issue in multi-agent reinforcement learning is how to efficiently accelerate training process. To address this problem, current research has leveraged a centralized function(CF) across multiple agents to learn contribution of the team reward for each agent. However, CF based methods introduce joint error from other agents in estimation of value network. In so doing, inspired by federated learning, we propose three simple novel approaches called Average Periodically Parameter Sharing(A-PPS), Reward-Scalability Periodically Parameter Sharing(RS-PPS) and Partial Personalized Periodically Parameter Sharing(PP-PPS) mechanism to accelerate training of MARL. Agents share Q-value network periodically during the training process. Agents which has same identity adapt collected reward as scalability and update partial neural network during period to share different parameters. We apply our approaches in classical MARL method QMIX and evaluate our approaches on various tasks in StarCraft Multi-Agent Challenge(SMAC) environment. Performance of numerical experiments yield enormous enhancement, with an average improvement of 10\%-30\%, and enable to win tasks that QMIX cannot. Our code can be downloaded from https://github.com/ColaZhang22/PPS-QMIX

PPS-QMIX: Periodically Parameter Sharing for Accelerating Convergence of Multi-Agent Reinforcement Learning

TL;DR

The paper addresses slow convergence in multi-agent reinforcement learning caused by distribution drift among agents. It introduces three periodically parameter sharing variants—A-PPS, RS-PPS, and PP-PPS—that periodically share parts or whole components of the QMIX value network across agents, leveraging ideas from Federated Learning to mitigate non-IID exploration. Empirical results in StarCraft SMAC show average performance gains of 10–30% and enable solving tasks that QMIX cannot, with RS-PPS delivering the strongest results in high-dimensional scenarios. The proposed methods are compatible with existing value-function factorization approaches and offer a practical pathway to faster, more robust MARL training without sharing raw trajectories.

Abstract

Training for multi-agent reinforcement learning(MARL) is a time-consuming process caused by distribution shift of each agent. One drawback is that strategy of each agent in MARL is independent but actually in cooperation. Thus, a vertical issue in multi-agent reinforcement learning is how to efficiently accelerate training process. To address this problem, current research has leveraged a centralized function(CF) across multiple agents to learn contribution of the team reward for each agent. However, CF based methods introduce joint error from other agents in estimation of value network. In so doing, inspired by federated learning, we propose three simple novel approaches called Average Periodically Parameter Sharing(A-PPS), Reward-Scalability Periodically Parameter Sharing(RS-PPS) and Partial Personalized Periodically Parameter Sharing(PP-PPS) mechanism to accelerate training of MARL. Agents share Q-value network periodically during the training process. Agents which has same identity adapt collected reward as scalability and update partial neural network during period to share different parameters. We apply our approaches in classical MARL method QMIX and evaluate our approaches on various tasks in StarCraft Multi-Agent Challenge(SMAC) environment. Performance of numerical experiments yield enormous enhancement, with an average improvement of 10\%-30\%, and enable to win tasks that QMIX cannot. Our code can be downloaded from https://github.com/ColaZhang22/PPS-QMIX
Paper Structure (13 sections, 13 equations, 5 figures, 2 tables)

This paper contains 13 sections, 13 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Distribution drift in decentralized training for agents. Due to distribution drift in exploration process, each agent enable to acquire local optima but cannot get to global optima. To solve Non-IID in agent experience trajectory, each agent explore their own environment and transmitted local model to aggregate into a generalized model.
  • Figure 2: Architecture of parameter sharing QMIX. Middle of figure represents overall architecture. Each agent in architecture has its own value network like brown component in right part. The purple components are aggregation block. Value network parameters of each agent are shared by three approaches periodically to enhance experience of each agent.
  • Figure 3: Three periodically parameters sharing approaches in QMIX. (a) Average periodically parameter sharing(A-PPS) adapts equal weight for each agent value network; (b) Reward-scalability periodically parameter sharing(RS-PPS) introduce a reward buffer to storage acquired reward in process of exploration as aggregate weight. (c) Partial Personalized Periodically Parameter Sharing(PP-PPS) divide agent value network into two parts, personalized representation and value function, PP-PPS keep personalized representation unchanged and just aggregates value function part.
  • Figure 4: Comparative Performance(QMIX) in the SC2 environment. (a) (b) (c) are asymmetric environments and (d)(e) are symmetric environments. Performance of our method outperforms in (a) (b) (c) tasks compared with conventional approach.
  • Figure 5: Comparative Performance(VDN) in the SC2 environment. (a) (b) (c) are asymmetric environments and (d)(e) are symmetric environments. Performance of our method outperforms in (a) (b) (c) tasks compared with conventional approach.