Table of Contents
Fetching ...

HyperMARL: Adaptive Hypernetworks for Multi-Agent RL

Kale-ab Abebe Tessera, Arrasy Rahman, Amos Storkey, Stefano V. Albrecht

TL;DR

HyperMARL introduces agent-conditioned hypernetworks to generate per-agent policy and critic weights, explicitly decoupling agent identity from observations to reduce cross-agent gradient interference in parameter-sharing MARL. This gradient decoupling enables adaptive behaviours—ranging from specialised to homogeneous—without altering learning objectives or requiring manual diversity tuning, and it yields lower policy gradient variance while preserving behavioural diversity. Empirical results across 22 MARL scenarios with up to 30 agents show competitive performance against strong baselines and robust specialization in heterogeneous tasks, while homogeneous tasks remain effectively handled, including recurrent settings. The work demonstrates that gradient decoupling via hypernetworks is a principled and scalable route to adaptive MARL, with practical implications for large-scale multi-agent systems where behavioural diversity is crucial.

Abstract

Adaptive cooperation in multi-agent reinforcement learning (MARL) requires policies to express homogeneous, specialised, or mixed behaviours, yet achieving this adaptivity remains a critical challenge. While parameter sharing (PS) is standard for efficient learning, it notoriously suppresses the behavioural diversity required for specialisation. This failure is largely due to cross-agent gradient interference, a problem we find is surprisingly exacerbated by the common practice of coupling agent IDs with observations. Existing remedies typically add complexity through altered objectives, manual preset diversity levels, or sequential updates -- raising a fundamental question: can shared policies adapt without these intricacies? We propose a solution built on a key insight: an agent-conditioned hypernetwork can generate agent-specific parameters and decouple observation- and agent-conditioned gradients, directly countering the interference from coupling agent IDs with observations. Our resulting method, HyperMARL, avoids the complexities of prior work and empirically reduces policy gradient variance. Across diverse MARL benchmarks (22 scenarios, up to 30 agents), HyperMARL achieves performance competitive with six key baselines while preserving behavioural diversity comparable to non-parameter sharing methods, establishing it as a versatile and principled approach for adaptive MARL. The code is publicly available at https://github.com/KaleabTessera/HyperMARL.

HyperMARL: Adaptive Hypernetworks for Multi-Agent RL

TL;DR

HyperMARL introduces agent-conditioned hypernetworks to generate per-agent policy and critic weights, explicitly decoupling agent identity from observations to reduce cross-agent gradient interference in parameter-sharing MARL. This gradient decoupling enables adaptive behaviours—ranging from specialised to homogeneous—without altering learning objectives or requiring manual diversity tuning, and it yields lower policy gradient variance while preserving behavioural diversity. Empirical results across 22 MARL scenarios with up to 30 agents show competitive performance against strong baselines and robust specialization in heterogeneous tasks, while homogeneous tasks remain effectively handled, including recurrent settings. The work demonstrates that gradient decoupling via hypernetworks is a principled and scalable route to adaptive MARL, with practical implications for large-scale multi-agent systems where behavioural diversity is crucial.

Abstract

Adaptive cooperation in multi-agent reinforcement learning (MARL) requires policies to express homogeneous, specialised, or mixed behaviours, yet achieving this adaptivity remains a critical challenge. While parameter sharing (PS) is standard for efficient learning, it notoriously suppresses the behavioural diversity required for specialisation. This failure is largely due to cross-agent gradient interference, a problem we find is surprisingly exacerbated by the common practice of coupling agent IDs with observations. Existing remedies typically add complexity through altered objectives, manual preset diversity levels, or sequential updates -- raising a fundamental question: can shared policies adapt without these intricacies? We propose a solution built on a key insight: an agent-conditioned hypernetwork can generate agent-specific parameters and decouple observation- and agent-conditioned gradients, directly countering the interference from coupling agent IDs with observations. Our resulting method, HyperMARL, avoids the complexities of prior work and empirically reduces policy gradient variance. Across diverse MARL benchmarks (22 scenarios, up to 30 agents), HyperMARL achieves performance competitive with six key baselines while preserving behavioural diversity comparable to non-parameter sharing methods, establishing it as a versatile and principled approach for adaptive MARL. The code is publicly available at https://github.com/KaleabTessera/HyperMARL.

Paper Structure

This paper contains 59 sections, 1 theorem, 12 equations, 23 figures, 25 tables, 1 algorithm.

Key Result

Theorem 1

A stochastic, shared policy without agent IDs cannot learn the optimal behaviour for the two-player Specialisation Game.

Figures (23)

  • Figure 1: HyperMARL Policy Architecture. Common agent-ID conditioned shared MARL policy (FuPS+ID, left) vs HyperMARL (right), which uses an agent-conditioned hypernetwork to generate agent-specific weights and decouples observation- and agent-conditioned gradients.
  • Figure 2: Specialisation and Synchronisation Games. The Specialisation game (left), which encourages distinct actions, and the Synchronisation game (right), where rewards encourage identical actions. Depicted are their two-player payoff matrices (pure Nash equilibria in blue) and $N$-player interaction schematics. While simple in form, these games are challenging MARL benchmarks due to non-stationarity and exponentially scaling observation spaces (temporal version).
  • Figure 3: Multi-agent policy gradient methods in the Specialisation environment. The FuPS+ID (No State) ablation outperforms FuPS+ID, showing near-orthogonal gradients (\ref{['subfig:grad_conflict']}), indicating that observation–ID decoupling is important. HyperMARL (MLP) enables this decoupling while leveraging state information, and achieves better performance and reduced gradient conflict than FuPS+ID.
  • Figure 4: Performance and gradient analysis.(a,b) IPPO and MAPPO on Dispersion (20M timesteps) - IQM of Mean Episode Return with 95% bootstrap CIs: Hypernetworks match NoPS performance while FuPS struggle with specialisation. Interval estimates in App. \ref{['append:dispersion_dynamics']}. (c) Actor gradient variance: Hypernetworks achieve lower gradient variance than FuPS+ID. (d) Policy diversity (SND with Jensen–Shannon distance): Hypernetworks achieve NoPS-level diversity while sharing parameters.
  • Figure 5: 17-agent Humanoid learning dynamics (IQM, 95% CI). HyperMARL, utilising a shared actor architecture, outperforms MAPPO-FuPS (non-overlapping CIs) and matches the performance of methods employing non-shared or sequential actors. This challenging environment is recognised for its high variance in outcomes across different methods JMLR:v25:23-0488.
  • ...and 18 more figures

Theorems & Definitions (3)

  • Definition 1
  • Theorem 1
  • proof