Table of Contents
Fetching ...

R3DM: Enabling Role Discovery and Diversity Through Dynamics Models in Multi-agent Reinforcement Learning

Harsh Goel, Mohammad Omama, Behdad Chalaki, Vaishnav Tadiparthi, Ehsan Moradi Pari, Sandeep Chinchali

TL;DR

R3DM tackles the limitation of role-based MARL by tying an agent’s role to its future behavior through a learned dynamics model, formalized with a mutual-information objective. It decomposes role learning into intermediate role embeddings learned via contrastive learning and intrinsic rewards that promote diverse, role-consistent futures, optimized within a CTDE framework. Empirical results on SMAC and SMACv2 show improved coordination and sample efficiency, with notable gains on hard scenarios and robust qualitative evidence of distinct role differentiation. This approach advances MARL by integrating model-based dynamics with information-theoretic role discovery to achieve more reliable, cooperative multi-agent behavior.

Abstract

Multi-agent reinforcement learning (MARL) has achieved significant progress in large-scale traffic control, autonomous vehicles, and robotics. Drawing inspiration from biological systems where roles naturally emerge to enable coordination, role-based MARL methods have been proposed to enhance cooperation learning for complex tasks. However, existing methods exclusively derive roles from an agent's past experience during training, neglecting their influence on its future trajectories. This paper introduces a key insight: an agent's role should shape its future behavior to enable effective coordination. Hence, we propose Role Discovery and Diversity through Dynamics Models (R3DM), a novel role-based MARL framework that learns emergent roles by maximizing the mutual information between agents' roles, observed trajectories, and expected future behaviors. R3DM optimizes the proposed objective through contrastive learning on past trajectories to first derive intermediate roles that shape intrinsic rewards to promote diversity in future behaviors across different roles through a learned dynamics model. Benchmarking on SMAC and SMACv2 environments demonstrates that R3DM outperforms state-of-the-art MARL approaches, improving multi-agent coordination to increase win rates by up to 20%. The code is available at https://github.com/UTAustin-SwarmLab/R3DM.

R3DM: Enabling Role Discovery and Diversity Through Dynamics Models in Multi-agent Reinforcement Learning

TL;DR

R3DM tackles the limitation of role-based MARL by tying an agent’s role to its future behavior through a learned dynamics model, formalized with a mutual-information objective. It decomposes role learning into intermediate role embeddings learned via contrastive learning and intrinsic rewards that promote diverse, role-consistent futures, optimized within a CTDE framework. Empirical results on SMAC and SMACv2 show improved coordination and sample efficiency, with notable gains on hard scenarios and robust qualitative evidence of distinct role differentiation. This approach advances MARL by integrating model-based dynamics with information-theoretic role discovery to achieve more reliable, cooperative multi-agent behavior.

Abstract

Multi-agent reinforcement learning (MARL) has achieved significant progress in large-scale traffic control, autonomous vehicles, and robotics. Drawing inspiration from biological systems where roles naturally emerge to enable coordination, role-based MARL methods have been proposed to enhance cooperation learning for complex tasks. However, existing methods exclusively derive roles from an agent's past experience during training, neglecting their influence on its future trajectories. This paper introduces a key insight: an agent's role should shape its future behavior to enable effective coordination. Hence, we propose Role Discovery and Diversity through Dynamics Models (R3DM), a novel role-based MARL framework that learns emergent roles by maximizing the mutual information between agents' roles, observed trajectories, and expected future behaviors. R3DM optimizes the proposed objective through contrastive learning on past trajectories to first derive intermediate roles that shape intrinsic rewards to promote diversity in future behaviors across different roles through a learned dynamics model. Benchmarking on SMAC and SMACv2 environments demonstrates that R3DM outperforms state-of-the-art MARL approaches, improving multi-agent coordination to increase win rates by up to 20%. The code is available at https://github.com/UTAustin-SwarmLab/R3DM.

Paper Structure

This paper contains 36 sections, 4 theorems, 32 equations, 5 figures, 4 tables, 1 algorithm.

Key Result

Theorem 4.1

Given a set of roles $M$ with cardinality $|M|$, a role $m_i^t \in M$, and a concatenated observation-action trajectory $\tau_i^{t+k}$ which comprises its observation-action history $\tau_i^{t}$ and future trajectory $\tau_i^{t+1:t+k}$ with $k$ steps, if $e_i^{t} = f_{\theta_{e}} \left(\tau_i^{t}\ri where $I(\tau_i^{t+1:t+k}; z_i^{t})$ is the MI between the future trajectory $\tau_i^{t+1:t+k}$ and

Figures (5)

  • Figure 1: In a fire-fighting scenario with two drones, standard role-based multi-agent RL methods fail to distribute drones effectively, as roles are inferred from exhibited behavior. By linking roles to future expected behavior via a dynamics model, R3DM achieves better role differentiation and coordination.
  • Figure 2: Test Win Rate of R3DM compared to baselines on 6 maps in the SMAC. We observe that R3DM improves sample efficiency, and converges to higher win rates on super-hard environments such as 3s5z_vs_3s6z, Corridor, and 6h_vs_8z.
  • Figure 3: Comparison of Test Win Rate and Test Cumulative Reward on the SMACv2 suite of environments. We observe that R3DM showcases better returns highlighting better strategies learned in environments such as protoss_5_vs_5 and terran_5_vs_5, where its test win rate is equivalent to the best-performing baseline ACORM. In zerg_5_vs_5, protoss_10_vs_11 and zerg_10_vs_11 environments, R3DM outperforms the baselines in terms of test win rates. Note that we report the means (solid line) and standard deviation (shaded regions) across 5 seeds.
  • Figure 4: We show qualitative results on the 3s_vs_5z environment with the corresponding role embeddings and the clusters. R3DM learns a better strategy compared to the baseline ACORM, where one stalker agent, as shown in timestep 20, successfully learns a distinct role that lures enemy zealots for the main team to beat a weakened enemy force in the subsequent timesteps. While ACORM learns differentiated roles based on past observations, the resulting policies are inadequate to win against the enemy team.
  • Figure 5: We conduct an ablation study on the 3s5z_vs_3s6z environment to evaluate the impact of: 1) Imagination Horizon for Reward, 2) Number of Roles, and 3) Role optimization without Contrastive Learning. We observe that a R3DM with (a) shorter imagination horizons ($k=1,2$) outperform longer ones ($k=5,10$) due to reduced compounding errors, (b) moderate role cardinality ($N_r=3$) achieves faster convergence despite similar final performance across configurations, and (c) full R3DM with both contrastive learning and intrinsic rewards demonstrates superior performance compared to the partial implementations.

Theorems & Definitions (8)

  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • proof
  • proof
  • Lemma 1.1
  • proof
  • proof