Table of Contents
Fetching ...

Almost Sure Convergence of Networked Policy Gradient over Time-Varying Networks in Markov Potential Games

Sarper Aydin, Ceyhun Eksin

TL;DR

This work tackles solving Markov potential games (MPGs) with distributed, differentiable policies by introducing networked policy gradient play. Agents update their own policy parameters using stochastic gradients estimated from two consecutive episodes and maintain beliefs about others' parameters via consensus over time-varying networks. The authors prove almost sure convergence to a stationary point of the MPG's potential function with a rate of $O(1/\epsilon^2)$, under mild assumptions and without requiring bounded gradients or perfect initial agreement. They also show that allowing initial belief errors and using advantage or temporal-difference estimators improves stability and performance. Numerical experiments on a dynamic multi-agent newsvendor problem demonstrate that networked policies can achieve higher rewards with convergence behavior comparable to independent policy gradients, validating the practical value of the approach in distributed multi-agent settings with evolving communication graphs.

Abstract

We propose networked policy gradient play for solving Markov potential games with continuous and/or discrete state-action pairs. During the game, agents use parametrized and differentiable policies that depend on the current state and the policy parameters of other agents. During training, agents update their policy parameters following stochastic gradients. The gradient estimation involves two consecutive episodes, generating unbiased estimators of reward and policy score functions. In addition, it involves keeping estimates of others' parameters using consensus steps given local estimates received through a time-varying communication network. In Markov potential games, there exists a potential value function among agents with gradients corresponding to the gradients of local value functions. Using this structure, we prove almost sure convergence to a stationary point of the potential value function with rate $O(1/ε^2)$. Compared to previous works, our results do not require bounded policy gradients or initial agreement on the values of individual policy parameters. Numerical experiments on a dynamic multi-agent newsvendor problem verify the convergence of local beliefs and gradients. It further shows that networked policy gradient play converges as fast as independent policy gradient updates, while collecting higher rewards.

Almost Sure Convergence of Networked Policy Gradient over Time-Varying Networks in Markov Potential Games

TL;DR

This work tackles solving Markov potential games (MPGs) with distributed, differentiable policies by introducing networked policy gradient play. Agents update their own policy parameters using stochastic gradients estimated from two consecutive episodes and maintain beliefs about others' parameters via consensus over time-varying networks. The authors prove almost sure convergence to a stationary point of the MPG's potential function with a rate of , under mild assumptions and without requiring bounded gradients or perfect initial agreement. They also show that allowing initial belief errors and using advantage or temporal-difference estimators improves stability and performance. Numerical experiments on a dynamic multi-agent newsvendor problem demonstrate that networked policies can achieve higher rewards with convergence behavior comparable to independent policy gradients, validating the practical value of the approach in distributed multi-agent settings with evolving communication graphs.

Abstract

We propose networked policy gradient play for solving Markov potential games with continuous and/or discrete state-action pairs. During the game, agents use parametrized and differentiable policies that depend on the current state and the policy parameters of other agents. During training, agents update their policy parameters following stochastic gradients. The gradient estimation involves two consecutive episodes, generating unbiased estimators of reward and policy score functions. In addition, it involves keeping estimates of others' parameters using consensus steps given local estimates received through a time-varying communication network. In Markov potential games, there exists a potential value function among agents with gradients corresponding to the gradients of local value functions. Using this structure, we prove almost sure convergence to a stationary point of the potential value function with rate . Compared to previous works, our results do not require bounded policy gradients or initial agreement on the values of individual policy parameters. Numerical experiments on a dynamic multi-agent newsvendor problem verify the convergence of local beliefs and gradients. It further shows that networked policy gradient play converges as fast as independent policy gradient updates, while collecting higher rewards.

Paper Structure

This paper contains 22 sections, 8 theorems, 76 equations, 3 figures, 2 algorithms.

Key Result

Lemma 1

The policy gradient of utility function $u_i$ with respect to the parameters $\theta_i$ can be stated as follows, where $b_i: {\mathcal{S}} \rightarrow {\mathbb R}$ is a baseline function for each agent $i \in {\mathcal{N}}$, independently defined from the joint actions of agents $a \in {\mathcal{A}}^N$.

Figures (3)

  • Figure 1: Average cumulative rewards $\hat{R}_{i}$\ref{['eq_r_info_1']} at each episode. (Left) and stochastic Gradients $\frac{1}{N} || \hat{\nabla}_i u_i(.)||$ (Right) over 100 replications. Star and Ind indicates networked and independent policies, respectively. We use the estimators Q, Adv, and TD for $\hat{R}_i$ defined in \ref{['eq_r_info']}-\ref{['eq_r_end_2']}, respectively. The lines show the exponential moving average with update rate of $0.05$, and shades show the 95% confidence intervals. For each algorithm, the reported results correspond to the best-performing configuration in terms of final accumulated rewards, selected among stepsizes with initial magnitudes of orders $10^1$, $10^0$, and $10^{-1}$, following a diminishing rate of $1/\sqrt{t}$ for over time for each $t>0$.
  • Figure 2: Local belief errors over $100$ replications on average $\frac{1}{N(N-1)}\sum_{i \in {\mathcal{N}}} \sum_{j \in {\mathcal{N}} \setminus \{i\}}||\theta_{i,t}-\hat{\theta}^j_{i,t} ||$ for (Left) Networked gradients with different reward estimations in a time-varying star-network (Right) Networked gradients with the advantage estimation in different communication network topologies.
  • Figure 3: Convergence results over 100 replications with different network topologies. (Left) Average cumulative rewards $\hat{R}_{i}$\ref{['eq_r_info_1']} at each episode. (Right) Norms of stochastic Gradients $\frac{1}{N} || \hat{\nabla}_i u_i(.)||$.

Theorems & Definitions (11)

  • Definition 1: Markov Potential Games
  • Lemma 1
  • Remark 1
  • Remark 2
  • Theorem 1
  • Theorem 2
  • Lemma 2: Lipschitz and Bounded Policy Gradients
  • Lemma 3
  • Lemma 4: Consensus of Beliefs
  • Lemma 5
  • ...and 1 more