Almost Sure Convergence of Networked Policy Gradient over Time-Varying Networks in Markov Potential Games
Sarper Aydin, Ceyhun Eksin
TL;DR
This work tackles solving Markov potential games (MPGs) with distributed, differentiable policies by introducing networked policy gradient play. Agents update their own policy parameters using stochastic gradients estimated from two consecutive episodes and maintain beliefs about others' parameters via consensus over time-varying networks. The authors prove almost sure convergence to a stationary point of the MPG's potential function with a rate of $O(1/\epsilon^2)$, under mild assumptions and without requiring bounded gradients or perfect initial agreement. They also show that allowing initial belief errors and using advantage or temporal-difference estimators improves stability and performance. Numerical experiments on a dynamic multi-agent newsvendor problem demonstrate that networked policies can achieve higher rewards with convergence behavior comparable to independent policy gradients, validating the practical value of the approach in distributed multi-agent settings with evolving communication graphs.
Abstract
We propose networked policy gradient play for solving Markov potential games with continuous and/or discrete state-action pairs. During the game, agents use parametrized and differentiable policies that depend on the current state and the policy parameters of other agents. During training, agents update their policy parameters following stochastic gradients. The gradient estimation involves two consecutive episodes, generating unbiased estimators of reward and policy score functions. In addition, it involves keeping estimates of others' parameters using consensus steps given local estimates received through a time-varying communication network. In Markov potential games, there exists a potential value function among agents with gradients corresponding to the gradients of local value functions. Using this structure, we prove almost sure convergence to a stationary point of the potential value function with rate $O(1/ε^2)$. Compared to previous works, our results do not require bounded policy gradients or initial agreement on the values of individual policy parameters. Numerical experiments on a dynamic multi-agent newsvendor problem verify the convergence of local beliefs and gradients. It further shows that networked policy gradient play converges as fast as independent policy gradient updates, while collecting higher rewards.
