Table of Contents
Fetching ...

Generative Evolutionary Meta-Solver (GEMS): Scalable Surrogate-Free Multi-Agent Reinforcement Learning

Alakh Sharma, Gaurish Trivedi, Kartikey Singh Bhandari, Yash Sinha, Dhruv Kumar, Pratik Narang, Jagat Sesh Challa

TL;DR

Generative Evolutionary Meta-Solver (GEMS), a surrogate-free framework that replaces explicit populations with a compact set of latent anchors and a single amortized generator, enabling scalable multi-agent reinforcement learning in multiple domains.

Abstract

Scalable multi-agent reinforcement learning (MARL) remains a central challenge for AI. Existing population-based methods, like Policy-Space Response Oracles, PSRO, require storing explicit policy populations and constructing full payoff matrices, incurring quadratic computation and linear memory costs. We present Generative Evolutionary Meta-Solver (GEMS), a surrogate-free framework that replaces explicit populations with a compact set of latent anchors and a single amortized generator. Instead of exhaustively constructing the payoff matrix, GEMS relies on unbiased Monte Carlo rollouts, multiplicative-weights meta-dynamics, and a model-free empirical-Bernstein UCB oracle to adaptively expand the policy set. Best responses are trained within the generator using an advantage-based trust-region objective, eliminating the need to store and train separate actors. We evaluated GEMS in a variety of Two-player and Multi-Player games such as the Deceptive Messages Game, Kuhn Poker and Multi-Particle environment. We find that GEMS is up to ~$\mathbf{6\times}$ faster, has $\mathbf{1.3\times}$ less memory usage than PSRO, while also reaps higher rewards simultaneously. These results demonstrate that GEMS retains the game theoretic guarantees of PSRO, while overcoming its fundamental inefficiencies, hence enabling scalable multi-agent learning in multiple domains.

Generative Evolutionary Meta-Solver (GEMS): Scalable Surrogate-Free Multi-Agent Reinforcement Learning

TL;DR

Generative Evolutionary Meta-Solver (GEMS), a surrogate-free framework that replaces explicit populations with a compact set of latent anchors and a single amortized generator, enabling scalable multi-agent reinforcement learning in multiple domains.

Abstract

Scalable multi-agent reinforcement learning (MARL) remains a central challenge for AI. Existing population-based methods, like Policy-Space Response Oracles, PSRO, require storing explicit policy populations and constructing full payoff matrices, incurring quadratic computation and linear memory costs. We present Generative Evolutionary Meta-Solver (GEMS), a surrogate-free framework that replaces explicit populations with a compact set of latent anchors and a single amortized generator. Instead of exhaustively constructing the payoff matrix, GEMS relies on unbiased Monte Carlo rollouts, multiplicative-weights meta-dynamics, and a model-free empirical-Bernstein UCB oracle to adaptively expand the policy set. Best responses are trained within the generator using an advantage-based trust-region objective, eliminating the need to store and train separate actors. We evaluated GEMS in a variety of Two-player and Multi-Player games such as the Deceptive Messages Game, Kuhn Poker and Multi-Particle environment. We find that GEMS is up to ~ faster, has less memory usage than PSRO, while also reaps higher rewards simultaneously. These results demonstrate that GEMS retains the game theoretic guarantees of PSRO, while overcoming its fundamental inefficiencies, hence enabling scalable multi-agent learning in multiple domains.

Paper Structure

This paper contains 97 sections, 17 theorems, 80 equations, 39 figures, 5 tables, 1 algorithm.

Key Result

Lemma 3.1

With rewards in $[0,1]$, the estimators are unbiased: $\mathbb{E}[\hat{v}_{t,i}] = (M\sigma_t)_i$ and $\mathbb{E}[\hat{\bar{r}}_t] = \sigma_t^\top M \sigma_t$. Moreover, for any $\delta \in (0,1)$, with probability at least $1-\delta$, where $\widehat{\mathrm{Var}}_{t,i}$ is the empirical variance of $\{Y_{i,s,\ell}\}$.

Figures (39)

  • Figure 1: Tournament analogy for policy populations. (Left) Psro: explicit $k \times k$ payoff matrix with all pairwise matchups. (Right) Gems: compact anchor set of latent policies, with a single generator producing diverse strategies on demand.
  • Figure 2: At each iteration $t$, Monte Carlo rollouts evaluate the current policy mixture under a fixed meta--strategy $\sigma_t$, producing estimated meta--values $\hat{v}_t$ (policy-to-mixture) and $\hat{\bar{r}}_t$ (mixture self-play). An optimistic meta-solver updates the mixture via OMWU using the hint $m_t = 2\hat{v}_t - \hat{v}_{t-1}$. An EB-UCB oracle then selects a new latent anchor $z_t^*$ from the candidate set, which is incorporated through amortized generator training with a trust-region objective (ABR-TR). The anchor set $Z_t$ is expanded accordingly, the generator induces updated policies $\pi_\varphi$, and the iteration advances to $t + 1$. Green ellipses denote temporal iteration boundaries rather than algorithmic operations.
  • Figure 3: Performance in the Deceptive Messages Game. Top Left: Gems Sender's ability to deceive converges to zero. Top Right: Gems Receiver's performance converges to the optimal reward of 0.8, outperforming all Psro-based baselines.
  • Figure 4: Equilibrium Finding in Kuhn Poker over 5 seeds [0--4]. Gems rapidly converges to a significantly lower exploitability than strong Psro baselines and NeuPL (Left), while demonstrating efficiency in cumulative training time (Right).
  • Figure 5: Emergent agent trajectories in the multi-agent tag environment. Top row: Gems. Bottom row: Psro. Columns show uniformly sampled frames from a single rollout (frames 0, 10, 20, 30, 40, 50 of the 50-frame GIF). This figure qualitatively compares the strategies learned by GEMS and classical PSRO. The top row shows that adversaries (red circles) trained with GEMS learn sophisticated, coordinated strategies like flanking and cornering to effectively trap the evader (green dot). In contrast, the bottom row shows that PSRO-trained agents adopt a less effective "herding" behavior, pursuing the target in a single, uncoordinated group. This clear difference in strategic complexity is consistent with the superior performance and higher returns achieved by GEMS, as reflected in the quantitative results.
  • ...and 34 more figures

Theorems & Definitions (27)

  • Lemma 3.1: Unbiasedness and Empirical-Bernstein Concentration
  • proof : Proof sketch.
  • Proposition 3.2: External Regret of OMWU under Unbiased Noise
  • Theorem 3.3: Instance-Dependent Oracle Regret
  • proof : Proof sketch.
  • Proposition 3.4: Exploitability decomposition
  • Theorem 3.5: Finite-Population Exploitability Bound
  • Theorem B.1: Empirical-Bernstein Inequality
  • Lemma B.2: Unbiasedness and Empirical-Bernstein Concentration
  • Theorem D.1: Instance-Dependent Oracle Regret
  • ...and 17 more