Measuring Policy Distance for Multi-Agent Reinforcement Learning
Tianyi Hu, Zhiqiang Pu, Xiaolin Ai, Tenghai Qiu, Jianqiang Yi
TL;DR
This work tackles the lack of a general metric for measuring policy differences in multi-agent reinforcement learning by introducing MAPD, a distance measure between agent policies computed through learned conditional latent representations. MAPD defines $d_{ij}$ between policies $\\pi_i$ and $\\pi_j$ and enforces key properties including triangle inequality, enabling meaningful comparisons across heterogeneous observation spaces. A customizable variant, customized MAPD, extends the approach to quantify differences along user-specified behavioral aspects via ELBO-based learning of customized features. As a practical application, the authors propose MADPS, online dynamic parameter sharing that fuses or divides network parameters based on MAPD-derived distances, achieving superior performance on multi-agent spread tasks and StarCraft II SMAC benchmarks. Overall, MAPD provides a principled, flexible tool for analyzing and leveraging policy diversity to improve MARL performance and scalability.
Abstract
Diversity plays a crucial role in improving the performance of multi-agent reinforcement learning (MARL). Currently, many diversity-based methods have been developed to overcome the drawbacks of excessive parameter sharing in traditional MARL. However, there remains a lack of a general metric to quantify policy differences among agents. Such a metric would not only facilitate the evaluation of the diversity evolution in multi-agent systems, but also provide guidance for the design of diversity-based MARL algorithms. In this paper, we propose the multi-agent policy distance (MAPD), a general tool for measuring policy differences in MARL. By learning the conditional representations of agents' decisions, MAPD can computes the policy distance between any pair of agents. Furthermore, we extend MAPD to a customizable version, which can quantify differences among agent policies on specified aspects. Based on the online deployment of MAPD, we design a multi-agent dynamic parameter sharing (MADPS) algorithm as an example of the MAPD's applications. Extensive experiments demonstrate that our method is effective in measuring differences in agent policies and specific behavioral tendencies. Moreover, in comparison to other methods of parameter sharing, MADPS exhibits superior performance.
