Controlling Behavioral Diversity in Multi-Agent Reinforcement Learning

Matteo Bettini; Ryan Kortvelesy; Amanda Prorok

Controlling Behavioral Diversity in Multi-Agent Reinforcement Learning

Matteo Bettini, Ryan Kortvelesy, Amanda Prorok

TL;DR

This work tackles the problem of controlling behavioral diversity in multi-agent reinforcement learning by introducing Diversity Control (DiCo), an architectural constraint that forces a set of heterogeneous agent policies to achieve a target diversity $\text{SND}_{des}$ without altering the learning objective. Policies are expressed as a sum of a shared homogeneous component and per-agent deviations, which are dynamically scaled by $\frac{\text{SND}_{des}}{\hat{\text{SND}}}$ to match the desired diversity, with $\hat{\text{SND}}$ computed from observed deviations and updated softly during training. The authors provide theoretical proofs that this scaling achieves the intended diversity and demonstrate the approach on a Multi-Agent Navigation case study and multiple VMAS tasks, showing improved performance, exploration, and the emergence of novel strategies under different diversity budgets. They also discuss extensions to inequality and analytical diversity control (via FINN), practical considerations for action-bounded spaces, and directions for automatic diversity optimization, highlighting DiCo's potential as a versatile tool for studying and leveraging diversity in MARL.

Abstract

The study of behavioral diversity in Multi-Agent Reinforcement Learning (MARL) is a nascent yet promising field. In this context, the present work deals with the question of how to control the diversity of a multi-agent system. With no existing approaches to control diversity to a set value, current solutions focus on blindly promoting it via intrinsic rewards or additional loss functions, effectively changing the learning objective and lacking a principled measure for it. To address this, we introduce Diversity Control (DiCo), a method able to control diversity to an exact value of a given metric by representing policies as the sum of a parameter-shared component and dynamically scaled per-agent components. By applying constraints directly to the policy architecture, DiCo leaves the learning objective unchanged, enabling its applicability to any actor-critic MARL algorithm. We theoretically prove that DiCo achieves the desired diversity, and we provide several experiments, both in cooperative and competitive tasks, that show how DiCo can be employed as a novel paradigm to increase performance and sample efficiency in MARL. Multimedia results are available on the paper's website: https://sites.google.com/view/dico-marl.

Controlling Behavioral Diversity in Multi-Agent Reinforcement Learning

TL;DR

without altering the learning objective. Policies are expressed as a sum of a shared homogeneous component and per-agent deviations, which are dynamically scaled by

to match the desired diversity, with

computed from observed deviations and updated softly during training. The authors provide theoretical proofs that this scaling achieves the intended diversity and demonstrate the approach on a Multi-Agent Navigation case study and multiple VMAS tasks, showing improved performance, exploration, and the emergence of novel strategies under different diversity budgets. They also discuss extensions to inequality and analytical diversity control (via FINN), practical considerations for action-bounded spaces, and directions for automatic diversity optimization, highlighting DiCo's potential as a versatile tool for studying and leveraging diversity in MARL.

Abstract

Paper Structure (36 sections, 3 theorems, 38 equations, 14 figures, 2 algorithms)

This paper contains 36 sections, 3 theorems, 38 equations, 14 figures, 2 algorithms.

Introduction
Related Works
Background
Problem Formulation
Method
Representing Policies as Heterogeneous Deviations From A Homogeneous Reference
Constraining Heterogeneous Policies via Rescaling
Case Study: Multi-Agent Navigation
Experiments
Dispersion: Tackling Multiple Objectives
Sampling: Boosting Exploration
Tag: Emergent Adversarial Strategies
Discussion and Limitations
Conclusion
Codebase and Links
...and 21 more sections

Key Result

Theorem 5.1

Given a set of multi-agent policies $\{\pi_i\}_{i\in\mathcal{N}}$ of the form presented in eq:scaled_policies and a desired diversity $\mathrm{SND}_\mathrm{des}$, then the diversity of the policies $\{\pi_{i}\}_{i\in\mathcal{N}}$ is equal to the desired value: $\mathrm{SND}(\{\pi_{i}\}_{i\in\mathcal

Figures (14)

Figure 1: DiCo architecture overview. Multi-agent policies are rescaled to match the desired behavioral diversity $\mathrm{SND}_\mathrm{des}$. The scaling factor is computed as the desired diversity divided by the actual diversity of the unscaled policies, which is updated during training. This process is described in \ref{['alg:cbd']}.
Figure 2: Illustration of the Multi-Agent Navigation case study. Top left: Example task rendering illustrating task components. System diversity is evaluated for each observation in the 2D space and plotted in the background colormap. Top center: Mean instantaneous reward for agents trained with different desired diversities. Top right:$\mathrm{SND}(\left \{ \pi_{i} \right \}_{i \in \mathcal{N}})$ evaluated for agents trained with different desired diversities. Bottom: Renderings of the diversity distribution over the observation space for agents trained with different desired diversities. With a low diversity budget, agents are not able to go to different goals and learn to converge to the midpoint between goals. As the diversity budget increases, agents learn to distribute diversity in the observations where it is most useful and learn more regular diversity landscapes than the unconstrained case. Curves report mean and standard deviation for the IPPO algorithm over 4 training seeds.
Figure 3: Multi-agent tasks from the VMAS simulator analyzed in our experiments.
Figure 4: Results from training agents with different constraints on the Dispersion task. Left: Mean instantaneous reward. Right: Measured diversity $\mathrm{SND}(\left \{ \pi_{i} \right \}_{i \in \mathcal{N}})$. Curves report mean and standard deviation for the MADDPG algorithm over 4 training seeds.
Figure 5: Results from training agents with different constraints on the Sampling task. Left: Mean instantaneous reward. Right: Measured diversity $\mathrm{SND}(\left \{ \pi_{i} \right \}_{i \in \mathcal{N}})$. Curves report mean and standard deviation for the IDDPG algorithm over 3 training seeds.
...and 9 more figures

Theorems & Definitions (7)

Theorem 5.1: Controlling diversity by rescaling agent policies
proof
Theorem 5.1: DiCo with general diversity metric
proof
proof
Theorem 14.1
proof

Controlling Behavioral Diversity in Multi-Agent Reinforcement Learning

TL;DR

Abstract

Controlling Behavioral Diversity in Multi-Agent Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (7)