Table of Contents
Fetching ...

Measuring Policy Distance for Multi-Agent Reinforcement Learning

Tianyi Hu, Zhiqiang Pu, Xiaolin Ai, Tenghai Qiu, Jianqiang Yi

TL;DR

This work tackles the lack of a general metric for measuring policy differences in multi-agent reinforcement learning by introducing MAPD, a distance measure between agent policies computed through learned conditional latent representations. MAPD defines $d_{ij}$ between policies $\\pi_i$ and $\\pi_j$ and enforces key properties including triangle inequality, enabling meaningful comparisons across heterogeneous observation spaces. A customizable variant, customized MAPD, extends the approach to quantify differences along user-specified behavioral aspects via ELBO-based learning of customized features. As a practical application, the authors propose MADPS, online dynamic parameter sharing that fuses or divides network parameters based on MAPD-derived distances, achieving superior performance on multi-agent spread tasks and StarCraft II SMAC benchmarks. Overall, MAPD provides a principled, flexible tool for analyzing and leveraging policy diversity to improve MARL performance and scalability.

Abstract

Diversity plays a crucial role in improving the performance of multi-agent reinforcement learning (MARL). Currently, many diversity-based methods have been developed to overcome the drawbacks of excessive parameter sharing in traditional MARL. However, there remains a lack of a general metric to quantify policy differences among agents. Such a metric would not only facilitate the evaluation of the diversity evolution in multi-agent systems, but also provide guidance for the design of diversity-based MARL algorithms. In this paper, we propose the multi-agent policy distance (MAPD), a general tool for measuring policy differences in MARL. By learning the conditional representations of agents' decisions, MAPD can computes the policy distance between any pair of agents. Furthermore, we extend MAPD to a customizable version, which can quantify differences among agent policies on specified aspects. Based on the online deployment of MAPD, we design a multi-agent dynamic parameter sharing (MADPS) algorithm as an example of the MAPD's applications. Extensive experiments demonstrate that our method is effective in measuring differences in agent policies and specific behavioral tendencies. Moreover, in comparison to other methods of parameter sharing, MADPS exhibits superior performance.

Measuring Policy Distance for Multi-Agent Reinforcement Learning

TL;DR

This work tackles the lack of a general metric for measuring policy differences in multi-agent reinforcement learning by introducing MAPD, a distance measure between agent policies computed through learned conditional latent representations. MAPD defines between policies and and enforces key properties including triangle inequality, enabling meaningful comparisons across heterogeneous observation spaces. A customizable variant, customized MAPD, extends the approach to quantify differences along user-specified behavioral aspects via ELBO-based learning of customized features. As a practical application, the authors propose MADPS, online dynamic parameter sharing that fuses or divides network parameters based on MAPD-derived distances, achieving superior performance on multi-agent spread tasks and StarCraft II SMAC benchmarks. Overall, MAPD provides a principled, flexible tool for analyzing and leveraging policy diversity to improve MARL performance and scalability.

Abstract

Diversity plays a crucial role in improving the performance of multi-agent reinforcement learning (MARL). Currently, many diversity-based methods have been developed to overcome the drawbacks of excessive parameter sharing in traditional MARL. However, there remains a lack of a general metric to quantify policy differences among agents. Such a metric would not only facilitate the evaluation of the diversity evolution in multi-agent systems, but also provide guidance for the design of diversity-based MARL algorithms. In this paper, we propose the multi-agent policy distance (MAPD), a general tool for measuring policy differences in MARL. By learning the conditional representations of agents' decisions, MAPD can computes the policy distance between any pair of agents. Furthermore, we extend MAPD to a customizable version, which can quantify differences among agent policies on specified aspects. Based on the online deployment of MAPD, we design a multi-agent dynamic parameter sharing (MADPS) algorithm as an example of the MAPD's applications. Extensive experiments demonstrate that our method is effective in measuring differences in agent policies and specific behavioral tendencies. Moreover, in comparison to other methods of parameter sharing, MADPS exhibits superior performance.
Paper Structure (16 sections, 8 equations, 6 figures)

This paper contains 16 sections, 8 equations, 6 figures.

Figures (6)

  • Figure 1: The relationship between our work and MARL. Our contributions are highlighted in bold and italicized.
  • Figure 2: Learning the conditional representation of an agent's decision.
  • Figure 3: Policy distance matrices in multi-agent spread tasks. In this scenario, there are 15 agents with 3 colors: agents numbered 1-5, 6-10, and 11-15 are each given the colors No.1, No.2, and No.3. The agents must move to the specific landmarks that matches their color. Theses matrices show the policy distances just for the first two agents in each of colored groups.
  • Figure 4: Customized policy distance matrices in multi-agent spread tasks. Figure (a) demonstrates the policy distances between agents on the tendency of moving towards a same colored landmark, figure (b) demonstrates the policy distances on the tendency of moving towards the matching landmark.
  • Figure 5: The basic idea of dynamic parameter sharing.
  • ...and 1 more figures