Table of Contents
Fetching ...

Distributed scalable coupled policy algorithm for networked multi-agent reinforcement learning

Pengcheng Dai, Dongming Wang, Wenwu Yu, Wei Ren

TL;DR

This work tackles distributed scalable learning in networked multi-agent RL where rewards and policies are interdependent across neighborhoods. It introduces neighbors' averaged Q and a DSCP algorithm that uses geometric 2-horizon sampling and push-sum consensus to avoid storing full Q-tables while preserving unbiased gradient estimates. Theoretical guarantees show unbiased gradient estimation, convergence of local policy parameters, and eventual convergence of the joint policy to a first-order stationary point, with convergence rates influenced by the coupling radius κ_p. Empirical results on robot path planning demonstrate that DSCP outperforms state-of-the-art baselines and validate the benefits of incorporating coupled policies in large-scale NMARL systems.

Abstract

This paper studies networked multi-agent reinforcement learning (NMARL) with interdependent rewards and coupled policies. In this setting, each agent's reward depends on its own state-action pair as well as those of its direct neighbors, and each agent's policy is parameterized by its local parameters together with those of its $κ_{p}$-hop neighbors, with $κ_{p}\geq 1$ denoting the coupled radius. The objective of the agents is to collaboratively optimize their policies to maximize the discounted average cumulative reward. To address the challenge of interdependent policies in collaborative optimization, we introduce a novel concept termed the neighbors' averaged $Q$-function and derive a new expression for the coupled policy gradient. Based on these theoretical foundations, we develop a distributed scalable coupled policy (DSCP) algorithm, where each agent relies only on the state-action pairs of its $κ_{p}$-hop neighbors and the rewards of its $(κ_{p}+1)$-hop neighbors. Specially, in the DSCP algorithm, we employ a geometric 2-horizon sampling method that does not require storing a full $Q$-table to obtain an unbiased estimate of the coupled policy gradient. Moreover, each agent interacts exclusively with its direct neighbors to obtain accurate policy parameters, while maintaining local estimates of other agents' parameters to execute its local policy and collect samples for optimization. These estimates and policy parameters are updated via a push-sum protocol, enabling distributed coordination of policy updates across the network. We prove that the joint policy produced by the proposed algorithm converges to a first-order stationary point of the objective function. Finally, the effectiveness of DSCP algorithm is demonstrated through simulations in a robot path planning environment, showing clear improvement over state-of-the-art methods.

Distributed scalable coupled policy algorithm for networked multi-agent reinforcement learning

TL;DR

This work tackles distributed scalable learning in networked multi-agent RL where rewards and policies are interdependent across neighborhoods. It introduces neighbors' averaged Q and a DSCP algorithm that uses geometric 2-horizon sampling and push-sum consensus to avoid storing full Q-tables while preserving unbiased gradient estimates. Theoretical guarantees show unbiased gradient estimation, convergence of local policy parameters, and eventual convergence of the joint policy to a first-order stationary point, with convergence rates influenced by the coupling radius κ_p. Empirical results on robot path planning demonstrate that DSCP outperforms state-of-the-art baselines and validate the benefits of incorporating coupled policies in large-scale NMARL systems.

Abstract

This paper studies networked multi-agent reinforcement learning (NMARL) with interdependent rewards and coupled policies. In this setting, each agent's reward depends on its own state-action pair as well as those of its direct neighbors, and each agent's policy is parameterized by its local parameters together with those of its -hop neighbors, with denoting the coupled radius. The objective of the agents is to collaboratively optimize their policies to maximize the discounted average cumulative reward. To address the challenge of interdependent policies in collaborative optimization, we introduce a novel concept termed the neighbors' averaged -function and derive a new expression for the coupled policy gradient. Based on these theoretical foundations, we develop a distributed scalable coupled policy (DSCP) algorithm, where each agent relies only on the state-action pairs of its -hop neighbors and the rewards of its -hop neighbors. Specially, in the DSCP algorithm, we employ a geometric 2-horizon sampling method that does not require storing a full -table to obtain an unbiased estimate of the coupled policy gradient. Moreover, each agent interacts exclusively with its direct neighbors to obtain accurate policy parameters, while maintaining local estimates of other agents' parameters to execute its local policy and collect samples for optimization. These estimates and policy parameters are updated via a push-sum protocol, enabling distributed coordination of policy updates across the network. We prove that the joint policy produced by the proposed algorithm converges to a first-order stationary point of the objective function. Finally, the effectiveness of DSCP algorithm is demonstrated through simulations in a robot path planning environment, showing clear improvement over state-of-the-art methods.

Paper Structure

This paper contains 32 sections, 10 theorems, 65 equations, 4 figures, 1 algorithm.

Key Result

Lemma 1

In the NMARL problem, for any joint policy $\bm{\pi_{\theta}}$, the global $Q$-function $Q^{\bm{\pi_{\theta}}}(\bm{s},\bm{a})$ can be decomposed as

Figures (4)

  • Figure 1: The left panel depicts the path structure with 13 locations, while the right panel shows the underlying network among 10 agents.
  • Figure 2: The evolution of $J(\bm{\theta}_{t})$ produced by Algorithm \ref{['distributedpolicygradientAlgorithm']}, SAC algorithm, and DPG algorithm.
  • Figure 3: The evolution of the policy parameter estimation error and the norm of policy gradient produced by Algorithm \ref{['distributedpolicygradientAlgorithm']}.
  • Figure 4: The performances of Algorithm \ref{['distributedpolicygradientAlgorithm']} with different $\kappa_{p}$ on objective function.

Theorems & Definitions (22)

  • Remark 1
  • Lemma 1
  • Theorem 1
  • Remark 2
  • Theorem 2
  • Remark 3
  • Remark 4
  • Remark 5
  • Lemma 2
  • Lemma 3
  • ...and 12 more