Table of Contents
Fetching ...

Neural Replicator Dynamics

Daniel Hennes, Dustin Morrill, Shayegan Omidshafiei, Remi Munos, Julien Perolat, Marc Lanctot, Audrunas Gruslys, Jean-Baptiste Lespiau, Paavo Parmas, Edgar Duenez-Guzman, Karl Tuyls

TL;DR

Neural Replicator Dynamics (NeuRD) addresses nonstationarity in multiagent policy-gradient learning by replacing the final softmax gradient step with a replicator-dynamics–inspired update, yielding a no-regret, time-averaged convergence to approximate Nash equilibria. The approach forms rigorous links to Hedge and CFR, and extends replicator dynamics to function approximation, enabling practical, adaptive learning in nonstationary MARL settings. The authors show NeuRD outperforms Softmax Policy Gradient in a range of imperfect-information benchmarks (Kuhn, Leduc, Goofspiel) and robustly handles reward nonstationarity, with theoretical connections to natural policy gradient. Overall, NeuRD provides a simple, principled one-line modification that preserves SPG’s scalability while enhancing adaptability and convergence properties in nonstationary environments.

Abstract

Policy gradient and actor-critic algorithms form the basis of many commonly used training techniques in deep reinforcement learning. Using these algorithms in multiagent environments poses problems such as nonstationarity and instability. In this paper, we first demonstrate that standard softmax-based policy gradient can be prone to poor performance in the presence of even the most benign nonstationarity. By contrast, it is known that the replicator dynamics, a well-studied model from evolutionary game theory, eliminates dominated strategies and exhibits convergence of the time-averaged trajectories to interior Nash equilibria in zero-sum games. Thus, using the replicator dynamics as a foundation, we derive an elegant one-line change to policy gradient methods that simply bypasses the gradient step through the softmax, yielding a new algorithm titled Neural Replicator Dynamics (NeuRD). NeuRD reduces to the exponential weights/Hedge algorithm in the single-state all-actions case. Additionally, NeuRD has formal equivalence to softmax counterfactual regret minimization, which guarantees convergence in the sequential tabular case. Importantly, our algorithm provides a straightforward way of extending the replicator dynamics to the function approximation setting. Empirical results show that NeuRD quickly adapts to nonstationarities, outperforming policy gradient significantly in both tabular and function approximation settings, when evaluated on the standard imperfect information benchmarks of Kuhn Poker, Leduc Poker, and Goofspiel.

Neural Replicator Dynamics

TL;DR

Neural Replicator Dynamics (NeuRD) addresses nonstationarity in multiagent policy-gradient learning by replacing the final softmax gradient step with a replicator-dynamics–inspired update, yielding a no-regret, time-averaged convergence to approximate Nash equilibria. The approach forms rigorous links to Hedge and CFR, and extends replicator dynamics to function approximation, enabling practical, adaptive learning in nonstationary MARL settings. The authors show NeuRD outperforms Softmax Policy Gradient in a range of imperfect-information benchmarks (Kuhn, Leduc, Goofspiel) and robustly handles reward nonstationarity, with theoretical connections to natural policy gradient. Overall, NeuRD provides a simple, principled one-line modification that preserves SPG’s scalability while enhancing adaptability and convergence properties in nonstationary environments.

Abstract

Policy gradient and actor-critic algorithms form the basis of many commonly used training techniques in deep reinforcement learning. Using these algorithms in multiagent environments poses problems such as nonstationarity and instability. In this paper, we first demonstrate that standard softmax-based policy gradient can be prone to poor performance in the presence of even the most benign nonstationarity. By contrast, it is known that the replicator dynamics, a well-studied model from evolutionary game theory, eliminates dominated strategies and exhibits convergence of the time-averaged trajectories to interior Nash equilibria in zero-sum games. Thus, using the replicator dynamics as a foundation, we derive an elegant one-line change to policy gradient methods that simply bypasses the gradient step through the softmax, yielding a new algorithm titled Neural Replicator Dynamics (NeuRD). NeuRD reduces to the exponential weights/Hedge algorithm in the single-state all-actions case. Additionally, NeuRD has formal equivalence to softmax counterfactual regret minimization, which guarantees convergence in the sequential tabular case. Importantly, our algorithm provides a straightforward way of extending the replicator dynamics to the function approximation setting. Empirical results show that NeuRD quickly adapts to nonstationarities, outperforming policy gradient significantly in both tabular and function approximation settings, when evaluated on the standard imperfect information benchmarks of Kuhn Poker, Leduc Poker, and Goofspiel.

Paper Structure

This paper contains 13 sections, 4 theorems, 26 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Corollary 3.0

Consider a sequential decision making task with finite length histories and $N$-agents. Assume that agent $i$ acts according to a softmax tabular policy, $\bm{\pi}_{i,t}(s) \propto \exp(\bm{y}_{i,t}(s))$, where $\bm{y}_{i,t}(s) \in \mathbb{R}^{\left|\mathcal{A}(s)\right|}$ is a vector of logits for where $\beta_{-i}(\bm{\pi}_{t - 1}, s) \left(q^{\bm{\pi}_{t-1}}_i(s, a) - v^{\bm{\pi}_{t-1}}_i(s)\r

Figures (6)

  • Figure 1: The regret of SPG with and without a forfeit action in repeated matching pennies compared to Hedge. The dashed line is a linear least-squares fit.
  • Figure 2: The logit and policy trajectories of SPG and Hedge in all-actions, 100-round, repeated matching pennies with a forfeit action. The vertical lines mark the change in the opponent's policy at 40-rounds. The step size $\eta=0.21$ was optimized in a parameter sweep for SPG with $T=100$.
  • Figure 3: Learning dynamics of \ref{['fig:rps_rd']} RD and \ref{['fig:rps_qpg']} SPG in Rock--Paper-Scissors (RPS). Time-averaged trajectories (solid lines) are shown in \ref{['fig:biased_rps_rd']} for RD and in \ref{['fig:biased_rps_qpg']} for SPG in the biased-RPS game. In \ref{['fig:rps_ratio']} we compare their rate of adaptation, i.e., $\lVert\dot{\bm{\pi}}_\text{RD}\rVert / \lVert\dot{\bm{\pi}}_\text{PG}\rVert$.
  • Figure 4: \ref{['fig:nerd_vs_pg_fixed_rps']}NashConv of the average NeuRD and SPG policies in biased RPS. \ref{['fig:nerd_vs_pg_allaction_leduc']}NashConv of the sequence-probability average policies of tabular, all-actions, counterfactual value NeuRD and SPG in two-player Leduc Poker.
  • Figure 5: Time-average policy NashConv in nonstationary RPS, with the game phases separated by vertical red dashes.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Corollary 3.0
  • Corollary 3.1
  • Theorem 1
  • Theorem 2
  • Remark