Multi-Agent Guided Policy Search for Non-Cooperative Dynamic Games

Jingqi Li; Gechen Qu; Jason J. Choi; Somayeh Sojoudi; Claire Tomlin

Multi-Agent Guided Policy Search for Non-Cooperative Dynamic Games

Jingqi Li, Gechen Qu, Jason J. Choi, Somayeh Sojoudi, Claire Tomlin

TL;DR

Non-cooperative dynamic games pose challenges for MARL due to non-stationarity and multiple equilibria. The authors introduce Multi-Agent Guided Policy Search (MA-GPS), which injects a model-based prior derived from local LQ approximations as a reward regularizer to stabilize gradient dynamics and guide agents toward a (approximate) Nash equilibrium. They prove local exponential convergence in infinite-horizon LQ games and extend the approach to nonlinear games by leveraging short-horizon local LQ games along current trajectories, avoiding expensive full iLQGames. Empirical results on nonlinear vehicle platooning and a six-player basketball formation show faster convergence and reduced variance compared with state-of-the-art MARL baselines, with the method yielding real-time, scalable policies.

Abstract

Multi-agent reinforcement learning (MARL) optimizes strategic interactions in non-cooperative dynamic games, where agents have misaligned objectives. However, data-driven methods such as multi-agent policy gradients (MA-PG) often suffer from instability and limit-cycle behaviors. Prior stabilization techniques typically rely on entropy-based exploration, which slows learning and increases variance. We propose a model-based approach that incorporates approximate priors into the reward function as regularization. In linear quadratic (LQ) games, we prove that such priors stabilize policy gradients and guarantee local exponential convergence to an approximate Nash equilibrium. We then extend this idea to infinite-horizon nonlinear games by introducing Multi-agent Guided Policy Search (MA-GPS), which constructs short-horizon local LQ approximations from trajectories of current policies to guide training. Experiments on nonlinear vehicle platooning and a six-player strategic basketball formation show that MA-GPS achieves faster convergence and more stable learning than existing MARL methods.

Multi-Agent Guided Policy Search for Non-Cooperative Dynamic Games

TL;DR

Abstract

Multi-Agent Guided Policy Search for Non-Cooperative Dynamic Games

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (8)