Table of Contents
Fetching ...

Independent RL for Cooperative-Competitive Agents: A Mean-Field Perspective

Muhammad Aneeq uz Zaman, Alec Koppel, Mathieu Laurière, Tamer Başar

TL;DR

The paper advances reinforcement learning for cooperative-competitive multi-agent systems by formulating a General Sum LQ Mean-Field Type Game (GS-MFTG) and deriving NE under an invertibility condition. It introduces Multi-player Receding-horizon Natural Policy Gradient (MRNPG), which enables independent, backward-in-time policy updates across time steps, aided by a Hamilton-Jacobi-Isaacs (HJI) framing and a decoupled mean-field/deviation decomposition. The authors prove linear convergence of MRNPG to NE under a time- and noise-independent diagonal-dominance condition, along with an $ ext{O}(1/M)$-Nash bound bridging the mean-field and finite-population models, and they validate results with numerical experiments. The work provides a data-driven, scalable approach to NE in CC MFGs with practical guarantees, while highlighting avenues for extending beyond the LQ setting.

Abstract

We address in this paper Reinforcement Learning (RL) among agents that are grouped into teams such that there is cooperation within each team but general-sum (non-zero sum) competition across different teams. To develop an RL method that provably achieves a Nash equilibrium, we focus on a linear-quadratic structure. Moreover, to tackle the non-stationarity induced by multi-agent interactions in the finite population setting, we consider the case where the number of agents within each team is infinite, i.e., the mean-field setting. This results in a General-Sum LQ Mean-Field Type Game (GS-MFTG). We characterize the Nash equilibrium (NE) of the GS-MFTG, under a standard invertibility condition. This MFTG NE is then shown to be $O(1/M)$-NE for the finite population game where $M$ is a lower bound on the number of agents in each team. These structural results motivate an algorithm called Multi-player Receding-horizon Natural Policy Gradient (MRNPG), where each team minimizes its cumulative cost \emph{independently} in a receding-horizon manner. Despite the non-convexity of the problem, we establish that the resulting algorithm converges to a global NE through a novel problem decomposition into sub-problems using backward recursive discrete-time Hamilton-Jacobi-Isaacs (HJI) equations, in which \emph{independent natural policy gradient} is shown to exhibit linear convergence under time-independent diagonal dominance. Numerical studies included corroborate the theoretical results.

Independent RL for Cooperative-Competitive Agents: A Mean-Field Perspective

TL;DR

The paper advances reinforcement learning for cooperative-competitive multi-agent systems by formulating a General Sum LQ Mean-Field Type Game (GS-MFTG) and deriving NE under an invertibility condition. It introduces Multi-player Receding-horizon Natural Policy Gradient (MRNPG), which enables independent, backward-in-time policy updates across time steps, aided by a Hamilton-Jacobi-Isaacs (HJI) framing and a decoupled mean-field/deviation decomposition. The authors prove linear convergence of MRNPG to NE under a time- and noise-independent diagonal-dominance condition, along with an -Nash bound bridging the mean-field and finite-population models, and they validate results with numerical experiments. The work provides a data-driven, scalable approach to NE in CC MFGs with practical guarantees, while highlighting avenues for extending beyond the LQ setting.

Abstract

We address in this paper Reinforcement Learning (RL) among agents that are grouped into teams such that there is cooperation within each team but general-sum (non-zero sum) competition across different teams. To develop an RL method that provably achieves a Nash equilibrium, we focus on a linear-quadratic structure. Moreover, to tackle the non-stationarity induced by multi-agent interactions in the finite population setting, we consider the case where the number of agents within each team is infinite, i.e., the mean-field setting. This results in a General-Sum LQ Mean-Field Type Game (GS-MFTG). We characterize the Nash equilibrium (NE) of the GS-MFTG, under a standard invertibility condition. This MFTG NE is then shown to be -NE for the finite population game where is a lower bound on the number of agents in each team. These structural results motivate an algorithm called Multi-player Receding-horizon Natural Policy Gradient (MRNPG), where each team minimizes its cumulative cost \emph{independently} in a receding-horizon manner. Despite the non-convexity of the problem, we establish that the resulting algorithm converges to a global NE through a novel problem decomposition into sub-problems using backward recursive discrete-time Hamilton-Jacobi-Isaacs (HJI) equations, in which \emph{independent natural policy gradient} is shown to exhibit linear convergence under time-independent diagonal dominance. Numerical studies included corroborate the theoretical results.
Paper Structure (16 sections, 20 theorems, 217 equations, 5 figures, 3 algorithms)

This paper contains 16 sections, 20 theorems, 217 equations, 5 figures, 3 algorithms.

Key Result

Theorem 2.2

\newlabelthm:eps_Nash0 The NE of the MFTG is $\epsilon$-Nash for the finite agent CC game eq:finite_agent_dyn-eq:finite_agent_utility where $\epsilon = \mathcal{O}(1/\min_{i \in [N]}M_i)$, i.e.

Figures (5)

  • Figure 1: MRNPG Algorithm employs Natural Policy Gradient (NPG) for each agent at timestep $t$, starting from $t=T-1$ and moving in a receding horizon manner (backwards-in-time), to approximate the NE of the game.
  • Figure 1: Numerical Analysis of MRNPG algorithm. (left) comparison with Vanilla Natural Policy Gradient (NPG) and Exact-MRNPG, (center) performance with respect to different values of learning rate $\eta^i_k$, and (right) mini-batch size $N_b$.
  • Figure 1: (a) MRNPG Algorithm converges to the NE using Natural Policy Gradient (NPG) starting from $t=T-1$ and moving in a receding horizon manner (backwards-in-time). (b) Cost functions ($C^i$ and $\tilde{C}^i$) of the $i$th agent in a two-shot game ($T=2$). The arrows denote how NPG in the previous timestep $t=1$ converges to the true NE.
  • Figure 1: Comparison between MRPG $(N_b = 2,000)$, SP-MRPG $(N_b = 2,000)$ and SP-MRPG $(N_b = 10,000)$.
  • Figure 2: Error convergence for exact versions of MRNPG, MF-MARL and MADPG for $N=\{3,6,9\}$, $T=3$ and $m=p=2$.

Theorems & Definitions (30)

  • Definition 2.1
  • Theorem 2.2
  • Theorem 2.3
  • Lemma 3.1
  • Lemma 4.1
  • Lemma 4.2: malik2019derivative
  • Lemma 4.4
  • Theorem 4.5
  • Theorem 4.6
  • Lemma B.1
  • ...and 20 more