Table of Contents
Fetching ...

Major-Minor Mean Field Multi-Agent Reinforcement Learning

Kai Cui, Christian Fabian, Anam Tahir, Heinz Koeppl

TL;DR

This work extends mean-field control to Major-Minor Mean Field Control (M3FC), enabling scalable cooperative MARL with many minor agents and a few major agents or environment states. It proves that stationary policies suffice and provides dynamic programming principles for the M3FC MDP, and shows that finite-agent MARL can be well-approximated by the limiting M3FC framework with convergence guarantees. The authors introduce M3FMARL, a policy-gradient method (PPO-based) that learns on finite M3FC systems but approximates the true gradient of the limiting M3FC MDP, achieving competitive performance against state-of-the-art MARL baselines. Experiments across five benchmark problems demonstrate stable learning, scalability with the number of minor agents, and effective credit assignment through the mean-field action, indicating practical impact for large-scale cooperative multi-agent settings.

Abstract

Multi-agent reinforcement learning (MARL) remains difficult to scale to many agents. Recent MARL using Mean Field Control (MFC) provides a tractable and rigorous approach to otherwise difficult cooperative MARL. However, the strict MFC assumption of many independent, weakly-interacting agents is too inflexible in practice. We generalize MFC to instead simultaneously model many similar and few complex agents -- as Major-Minor Mean Field Control (M3FC). Theoretically, we give approximation results for finite agent control, and verify the sufficiency of stationary policies for optimality together with a dynamic programming principle. Algorithmically, we propose Major-Minor Mean Field MARL (M3FMARL) for finite agent systems instead of the limiting system. The algorithm is shown to approximate the policy gradient of the underlying M3FC MDP. Finally, we demonstrate its capabilities experimentally in various scenarios. We observe a strong performance in comparison to state-of-the-art policy gradient MARL methods.

Major-Minor Mean Field Multi-Agent Reinforcement Learning

TL;DR

This work extends mean-field control to Major-Minor Mean Field Control (M3FC), enabling scalable cooperative MARL with many minor agents and a few major agents or environment states. It proves that stationary policies suffice and provides dynamic programming principles for the M3FC MDP, and shows that finite-agent MARL can be well-approximated by the limiting M3FC framework with convergence guarantees. The authors introduce M3FMARL, a policy-gradient method (PPO-based) that learns on finite M3FC systems but approximates the true gradient of the limiting M3FC MDP, achieving competitive performance against state-of-the-art MARL baselines. Experiments across five benchmark problems demonstrate stable learning, scalability with the number of minor agents, and effective credit assignment through the mean-field action, indicating practical impact for large-scale cooperative multi-agent settings.

Abstract

Multi-agent reinforcement learning (MARL) remains difficult to scale to many agents. Recent MARL using Mean Field Control (MFC) provides a tractable and rigorous approach to otherwise difficult cooperative MARL. However, the strict MFC assumption of many independent, weakly-interacting agents is too inflexible in practice. We generalize MFC to instead simultaneously model many similar and few complex agents -- as Major-Minor Mean Field Control (M3FC). Theoretically, we give approximation results for finite agent control, and verify the sufficiency of stationary policies for optimality together with a dynamic programming principle. Algorithmically, we propose Major-Minor Mean Field MARL (M3FMARL) for finite agent systems instead of the limiting system. The algorithm is shown to approximate the policy gradient of the underlying M3FC MDP. Finally, we demonstrate its capabilities experimentally in various scenarios. We observe a strong performance in comparison to state-of-the-art policy gradient MARL methods.
Paper Structure (50 sections, 17 theorems, 74 equations, 13 figures, 3 tables, 1 algorithm)

This paper contains 50 sections, 17 theorems, 74 equations, 13 figures, 3 tables, 1 algorithm.

Key Result

Theorem 2.1

Under Assumption ass:m3pcont, there exist optimal stationary, deterministic policies $\hat{\pi}$, $\pi^0$ for the M3FC MDP eq:m3fc by choosing $(\hat{\pi}(x^0, \mu), \pi^0(x^0, \mu))$ from the maximizers of $\mathop{\mathrm{arg\,max}}\limits_{(h, u^0) \in \mathcal{H}(\mu) \times \mathcal{U}^0} r(x^0

Figures (13)

  • Figure 1: Logistics example: Many drones are modelled as minor agent MF, while truck and package destinations are modelled by a major agent. (See Foraging problem in Section \ref{['sec:problems']})
  • Figure 2: Our M3FC-based MARL generalizes MFC-based MARL and standard single-agent RL in the solution space of general MARL solutions, reducing the otherwise combinatorial nature of MARL zhang2021multi to a tractable but still general setting.
  • Figure 3: The dynamics \ref{['eq:m3mdp']} as a probabilistic graphical model, with actions in grey (inputs omitted for readability). Diamonds denote deterministic functions. M3FC abstracts minor agents $i \in [N]$ by a LLN, considering only their MF as variables in the dotted box.
  • Figure 4: Approximation of intractable $N$-agent control by M3FC (blue path), the solution of which is near-optimal for large $N$.
  • Figure 5: Training curves (mean episode return) of M3FPPO (red), with shaded standard deviation, and maximum (blue) over all three trials (two for Foraging). (a) 2G; (b) Formation; (c) Beach; (d) Foraging; (e) Potential.
  • ...and 8 more figures

Theorems & Definitions (34)

  • Remark 1
  • Remark 2
  • Theorem 2.1
  • Remark 3
  • Theorem 2.2
  • Corollary 2.1
  • Theorem 3.1
  • Theorem 2.1
  • Theorem 2.2
  • Corollary 2.1
  • ...and 24 more