Table of Contents
Fetching ...

Multi-Agent Guided Policy Search for Non-Cooperative Dynamic Games

Jingqi Li, Gechen Qu, Jason J. Choi, Somayeh Sojoudi, Claire Tomlin

TL;DR

Non-cooperative dynamic games pose challenges for MARL due to non-stationarity and multiple equilibria. The authors introduce Multi-Agent Guided Policy Search (MA-GPS), which injects a model-based prior derived from local LQ approximations as a reward regularizer to stabilize gradient dynamics and guide agents toward a (approximate) Nash equilibrium. They prove local exponential convergence in infinite-horizon LQ games and extend the approach to nonlinear games by leveraging short-horizon local LQ games along current trajectories, avoiding expensive full iLQGames. Empirical results on nonlinear vehicle platooning and a six-player basketball formation show faster convergence and reduced variance compared with state-of-the-art MARL baselines, with the method yielding real-time, scalable policies.

Abstract

Multi-agent reinforcement learning (MARL) optimizes strategic interactions in non-cooperative dynamic games, where agents have misaligned objectives. However, data-driven methods such as multi-agent policy gradients (MA-PG) often suffer from instability and limit-cycle behaviors. Prior stabilization techniques typically rely on entropy-based exploration, which slows learning and increases variance. We propose a model-based approach that incorporates approximate priors into the reward function as regularization. In linear quadratic (LQ) games, we prove that such priors stabilize policy gradients and guarantee local exponential convergence to an approximate Nash equilibrium. We then extend this idea to infinite-horizon nonlinear games by introducing Multi-agent Guided Policy Search (MA-GPS), which constructs short-horizon local LQ approximations from trajectories of current policies to guide training. Experiments on nonlinear vehicle platooning and a six-player strategic basketball formation show that MA-GPS achieves faster convergence and more stable learning than existing MARL methods.

Multi-Agent Guided Policy Search for Non-Cooperative Dynamic Games

TL;DR

Non-cooperative dynamic games pose challenges for MARL due to non-stationarity and multiple equilibria. The authors introduce Multi-Agent Guided Policy Search (MA-GPS), which injects a model-based prior derived from local LQ approximations as a reward regularizer to stabilize gradient dynamics and guide agents toward a (approximate) Nash equilibrium. They prove local exponential convergence in infinite-horizon LQ games and extend the approach to nonlinear games by leveraging short-horizon local LQ games along current trajectories, avoiding expensive full iLQGames. Empirical results on nonlinear vehicle platooning and a six-player basketball formation show faster convergence and reduced variance compared with state-of-the-art MARL baselines, with the method yielding real-time, scalable policies.

Abstract

Multi-agent reinforcement learning (MARL) optimizes strategic interactions in non-cooperative dynamic games, where agents have misaligned objectives. However, data-driven methods such as multi-agent policy gradients (MA-PG) often suffer from instability and limit-cycle behaviors. Prior stabilization techniques typically rely on entropy-based exploration, which slows learning and increases variance. We propose a model-based approach that incorporates approximate priors into the reward function as regularization. In linear quadratic (LQ) games, we prove that such priors stabilize policy gradients and guarantee local exponential convergence to an approximate Nash equilibrium. We then extend this idea to infinite-horizon nonlinear games by introducing Multi-agent Guided Policy Search (MA-GPS), which constructs short-horizon local LQ approximations from trajectories of current policies to guide training. Experiments on nonlinear vehicle platooning and a six-player strategic basketball formation show that MA-GPS achieves faster convergence and more stable learning than existing MARL methods.

Paper Structure

This paper contains 15 sections, 3 theorems, 26 equations, 3 figures, 1 algorithm.

Key Result

Proposition 1

Under Assumption alg:ma-gps, let $K^*$ be a feedback Nash equilibrium policy, $\check{K}$ an arbitrary stabilizing feedback policy, and $\mathcal{K}$ the set of all stabilizing feedback policies. Then, we have

Figures (3)

  • Figure 1: Policy guidance stabilizes infinite-horizon LQ game dynamics, enables convergence to a close neighborhood of the true Nash equilibrium—even if the guidance is wrong. Each row corresponds to one of $K^{1}$ and $K^{2}$. In each plot, the two axes represent the entries $K^{i,1}$ and $K^{i,2}$ of the corresponding $K^{i}$, and the curves depict their trajectories under policy gradient with different values of $\rho$ and a biased guiding policy $\check{K}$ (potentially derived from an inaccurate dynamics model). A small $\rho$ ($\ge 0.001$) is sufficient to stabilize the policy gradient dynamics near the ground truth feedback Nash equilibrium $K^{*}$. However, as $\rho$ increases, the bias in the converged policies introduced by the guidance also grows.
  • Figure 2: Three-vehicle platooning experiment.(a) Total reward during GPU training time. Note that the training time includes the computation time of the LQ game solutions used to guide the MA-GPS. MA-GPS achieves the highest reward and lowest variance compared to the other four methods. (b) Trajectories of the vehicles. The leader vehicle guides the other vehicles to the center lane. MA-GPS effectively enables merging into a lane, while IPPO’s Car 3 gets stuck within the same GPU computation time.
  • Figure 3: Six-player strategic basketball formation experiment. (a) Total reward during GPU training time. MA-GPS achieves a higher-performing policy more quickly than IPPO, MA-DDPG, and MA-PPO, with lower variance when measured by GPU training time. In contrast, naive L2 policy regularization reduces variance but slows IPPO’s convergence. These results suggest that the local LQ game approximations introduced in Section \ref{['sec:local guidance']} can be computed efficiently, thereby accelerating MARL policy convergence. (b) Trajectories of Players. MA-GPS learns more effective policies than IPPO and MA-PPO within the same GPU computation time, as evidenced by the distinct formation of agents.

Theorems & Definitions (8)

  • Proposition 1
  • proof
  • Remark 1: On the uniqueness of Nash equilibria
  • Remark 2: Comparing with finite-horizon iLQGames
  • Lemma 1
  • Lemma 2
  • proof
  • proof : Proof of Proposition \ref{['prop: L2 stabilization']}