Table of Contents
Fetching ...

Addressing Rotational Learning Dynamics in Multi-Agent Reinforcement Learning

Baraah A. M. Sidahmed, Tatjana Chavdarova

TL;DR

This work identifies rotational learning dynamics as a core driver of instability and reproducibility problems in centralized training decentralized execution MARL. It reframes MARL as a Variational Inequality problem with operator $F$ and adopts VI-solvers, notably nested Lookahead VI (LA-VI) and Extragradient (EG), to stabilize joint policy and value updates. The authors introduce LA-MARL and (LA-)EG-MARL, provide convergence guarantees under monotone VI assumptions, and demonstrate improvements on zero-sum games and MPE benchmarks. The findings show VI-based optimization yields stronger convergence to equilibrium and better coordination, suggesting a practical and scalable path to more robust MARL systems.

Abstract

Multi-agent reinforcement learning (MARL) has emerged as a powerful paradigm for solving complex problems through agents' cooperation and competition, finding widespread applications across domains. Despite its success, MARL faces a reproducibility crisis. We show that, in part, this issue is related to the rotational optimization dynamics arising from competing agents' objectives, and require methods beyond standard optimization algorithms. We reframe MARL approaches using Variational Inequalities (VIs), offering a unified framework to address such issues. Leveraging optimization techniques designed for VIs, we propose a general approach for integrating gradient-based VI methods capable of handling rotational dynamics into existing MARL algorithms. Empirical results demonstrate significant performance improvements across benchmarks. In zero-sum games, Rock--paper--scissors and Matching pennies, VI methods achieve better convergence to equilibrium strategies, and in the Multi-Agent Particle Environment: Predator-prey, they also enhance team coordination. These results underscore the transformative potential of advanced optimization techniques in MARL.

Addressing Rotational Learning Dynamics in Multi-Agent Reinforcement Learning

TL;DR

This work identifies rotational learning dynamics as a core driver of instability and reproducibility problems in centralized training decentralized execution MARL. It reframes MARL as a Variational Inequality problem with operator and adopts VI-solvers, notably nested Lookahead VI (LA-VI) and Extragradient (EG), to stabilize joint policy and value updates. The authors introduce LA-MARL and (LA-)EG-MARL, provide convergence guarantees under monotone VI assumptions, and demonstrate improvements on zero-sum games and MPE benchmarks. The findings show VI-based optimization yields stronger convergence to equilibrium and better coordination, suggesting a practical and scalable path to more robust MARL systems.

Abstract

Multi-agent reinforcement learning (MARL) has emerged as a powerful paradigm for solving complex problems through agents' cooperation and competition, finding widespread applications across domains. Despite its success, MARL faces a reproducibility crisis. We show that, in part, this issue is related to the rotational optimization dynamics arising from competing agents' objectives, and require methods beyond standard optimization algorithms. We reframe MARL approaches using Variational Inequalities (VIs), offering a unified framework to address such issues. Leveraging optimization techniques designed for VIs, we propose a general approach for integrating gradient-based VI methods capable of handling rotational dynamics into existing MARL algorithms. Empirical results demonstrate significant performance improvements across benchmarks. In zero-sum games, Rock--paper--scissors and Matching pennies, VI methods achieve better convergence to equilibrium strategies, and in the Multi-Agent Particle Environment: Predator-prey, they also enhance team coordination. These results underscore the transformative potential of advanced optimization techniques in MARL.

Paper Structure

This paper contains 56 sections, 23 equations, 11 figures, 3 tables, 6 algorithms.

Figures (11)

  • Figure 1: Comparison between GD-(MADDPG/MATD3) and LA-(MADDPG/MATD3), on Rock--paper--scissors and Matching pennies.$x$-axis: training episodes. $y$-axis: total distance of agents' policies to the equilibrium policy; averaged over $10$ seeds.
  • Figure 2: Comparing the GD, LA, EG, and LA-EG optimization methods on the Rock--paper--scissors game.$x$-axis: training episodes. $y$-axis: squared norm of the learned policy probabilities relative to the equilibrium.
  • Figure 3: Rewards (left) vs. sampled actions from learned policies (right), of (LA-)MADDPG in the Rock--paper--scissors game. The baseline has saturating rewards (in the last part), however, that is not indicative of the agents' performances. Refer to Section \ref{['sec:results']} for a discussion, and Figure \ref{['fig:snapshots_rps_more_details']} for more detailed plots and larger action samples.
  • Figure 4: Comparison of different buffer configurations (see Appendix \ref{['app:buffer-conf']}) and methods on Rock--paper--scissors game. $x$-axis: training episodes. $y$-axis: $5$-seed average norm between the two players' policies and equilibrium policy $(\frac{1}{3}, \frac{1}{3}, \frac{1}{3})^2$. The dotted line indicates the point at which the buffer begins to change, either through shifting or clearing.
  • Figure 5: Compares MADDPG with different LA-MADDPG configurations to the baseline MADDPG with (Adam) in Rock--paper--scissors with a scheduled learning rate. $x$-axis: training episodes. $y$-axis: $5$-seed average norm between the two players' policies and equilibrium policy $(\frac{1}{3}, \frac{1}{3}, \frac{1}{3})^2$. The dotted lines depict the times when the learning rate was decreased by a factor of $10$.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Definition 2.1: monotonicity