Measuring Policy Distance for Multi-Agent Reinforcement Learning

Tianyi Hu; Zhiqiang Pu; Xiaolin Ai; Tenghai Qiu; Jianqiang Yi

Measuring Policy Distance for Multi-Agent Reinforcement Learning

Tianyi Hu, Zhiqiang Pu, Xiaolin Ai, Tenghai Qiu, Jianqiang Yi

TL;DR

This work tackles the lack of a general metric for measuring policy differences in multi-agent reinforcement learning by introducing MAPD, a distance measure between agent policies computed through learned conditional latent representations. MAPD defines $d_{ij}$ between policies $\\pi_i$ and $\\pi_j$ and enforces key properties including triangle inequality, enabling meaningful comparisons across heterogeneous observation spaces. A customizable variant, customized MAPD, extends the approach to quantify differences along user-specified behavioral aspects via ELBO-based learning of customized features. As a practical application, the authors propose MADPS, online dynamic parameter sharing that fuses or divides network parameters based on MAPD-derived distances, achieving superior performance on multi-agent spread tasks and StarCraft II SMAC benchmarks. Overall, MAPD provides a principled, flexible tool for analyzing and leveraging policy diversity to improve MARL performance and scalability.

Abstract

Diversity plays a crucial role in improving the performance of multi-agent reinforcement learning (MARL). Currently, many diversity-based methods have been developed to overcome the drawbacks of excessive parameter sharing in traditional MARL. However, there remains a lack of a general metric to quantify policy differences among agents. Such a metric would not only facilitate the evaluation of the diversity evolution in multi-agent systems, but also provide guidance for the design of diversity-based MARL algorithms. In this paper, we propose the multi-agent policy distance (MAPD), a general tool for measuring policy differences in MARL. By learning the conditional representations of agents' decisions, MAPD can computes the policy distance between any pair of agents. Furthermore, we extend MAPD to a customizable version, which can quantify differences among agent policies on specified aspects. Based on the online deployment of MAPD, we design a multi-agent dynamic parameter sharing (MADPS) algorithm as an example of the MAPD's applications. Extensive experiments demonstrate that our method is effective in measuring differences in agent policies and specific behavioral tendencies. Moreover, in comparison to other methods of parameter sharing, MADPS exhibits superior performance.

Measuring Policy Distance for Multi-Agent Reinforcement Learning

TL;DR

between policies

and

and enforces key properties including triangle inequality, enabling meaningful comparisons across heterogeneous observation spaces. A customizable variant, customized MAPD, extends the approach to quantify differences along user-specified behavioral aspects via ELBO-based learning of customized features. As a practical application, the authors propose MADPS, online dynamic parameter sharing that fuses or divides network parameters based on MAPD-derived distances, achieving superior performance on multi-agent spread tasks and StarCraft II SMAC benchmarks. Overall, MAPD provides a principled, flexible tool for analyzing and leveraging policy diversity to improve MARL performance and scalability.

Abstract

Paper Structure (16 sections, 8 equations, 6 figures)

This paper contains 16 sections, 8 equations, 6 figures.

Introduction
Background
Measuring Policy Distance between Agents
Analysis
Learning the conditional representations of agents' decisions
Multi-Agent Policy Distance
Case Study of MAPD
Measuring Customized Policy Distance
Learning customized representations
Case Study of Customized MAPD
Dynamic Parameter Sharing: an application of MAPD for MARL
Multi Agent Dynamic Parameter Sharing
Experiments
Experiment Settings
Superior Performance of Our Method
...and 1 more sections

Figures (6)

Figure 1: The relationship between our work and MARL. Our contributions are highlighted in bold and italicized.
Figure 2: Learning the conditional representation of an agent's decision.
Figure 3: Policy distance matrices in multi-agent spread tasks. In this scenario, there are 15 agents with 3 colors: agents numbered 1-5, 6-10, and 11-15 are each given the colors No.1, No.2, and No.3. The agents must move to the specific landmarks that matches their color. Theses matrices show the policy distances just for the first two agents in each of colored groups.
Figure 4: Customized policy distance matrices in multi-agent spread tasks. Figure (a) demonstrates the policy distances between agents on the tendency of moving towards a same colored landmark, figure (b) demonstrates the policy distances on the tendency of moving towards the matching landmark.
Figure 5: The basic idea of dynamic parameter sharing.
...and 1 more figures

Measuring Policy Distance for Multi-Agent Reinforcement Learning

TL;DR

Abstract

Measuring Policy Distance for Multi-Agent Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)