Table of Contents
Fetching ...

Continuous-Time Distributed Dynamic Programming for Networked Multi-Agent Markov Decision Processes

Donghwan Lee, Han-Dong Lim, Do Wan Kim

TL;DR

The paper tackles the problem of planning in networked, cooperative multi-agent MDPs by developing continuous-time distributed dynamic programming algorithms that operate with only local rewards and neighbor communications. It introduces two DP schemes: a Wang–Elia-inspired version with a consensus-coupled auxiliary variable, and a decoupled version that separates value-function estimation from parameter mixing. The authors prove global asymptotic stability and characterize equilibrium points for both schemes using ODE and Lyapunov analyses, providing a solid control-theoretic foundation for distributed planning in multi-agent settings. These results lay the groundwork for extending to distributed temporal-difference learning and multi-agent Q-learning in reinforcement learning contexts, enabling scalable, privacy-preserving coordination in networked systems.

Abstract

The main goal of this paper is to investigate continuous-time distributed dynamic programming (DP) algorithms for networked multi-agent Markov decision problems (MAMDPs). In our study, we adopt a distributed multi-agent framework where individual agents have access only to their own rewards, lacking insights into the rewards of other agents. Moreover, each agent has the ability to share its parameters with neighboring agents through a communication network, represented by a graph. We first introduce a novel distributed DP, inspired by the distributed optimization method of Wang and Elia. Next, a new distributed DP is introduced through a decoupling process. The convergence of the DP algorithms is proved through systems and control perspectives. The study in this paper sets the stage for new distributed temporal different learning algorithms.

Continuous-Time Distributed Dynamic Programming for Networked Multi-Agent Markov Decision Processes

TL;DR

The paper tackles the problem of planning in networked, cooperative multi-agent MDPs by developing continuous-time distributed dynamic programming algorithms that operate with only local rewards and neighbor communications. It introduces two DP schemes: a Wang–Elia-inspired version with a consensus-coupled auxiliary variable, and a decoupled version that separates value-function estimation from parameter mixing. The authors prove global asymptotic stability and characterize equilibrium points for both schemes using ODE and Lyapunov analyses, providing a solid control-theoretic foundation for distributed planning in multi-agent settings. These results lay the groundwork for extending to distributed temporal-difference learning and multi-agent Q-learning in reinforcement learning contexts, enabling scalable, privacy-preserving coordination in networked systems.

Abstract

The main goal of this paper is to investigate continuous-time distributed dynamic programming (DP) algorithms for networked multi-agent Markov decision problems (MAMDPs). In our study, we adopt a distributed multi-agent framework where individual agents have access only to their own rewards, lacking insights into the rewards of other agents. Moreover, each agent has the ability to share its parameters with neighboring agents through a communication network, represented by a graph. We first introduce a novel distributed DP, inspired by the distributed optimization method of Wang and Elia. Next, a new distributed DP is introduced through a decoupling process. The convergence of the DP algorithms is proved through systems and control perspectives. The study in this paper sets the stage for new distributed temporal different learning algorithms.
Paper Structure (16 sections, 5 theorems, 45 equations, 7 figures, 2 algorithms)

This paper contains 16 sections, 5 theorems, 45 equations, 7 figures, 2 algorithms.

Key Result

Proposition 1

$\bar{\theta} ^*$ is a unique asymptotically stable equilibrium point of the linear system in eq:1, i.e., $\bar{\theta}_t \to \bar{\theta}^*$ as $t\to \infty$.

Figures (7)

  • Figure 1: Network topology of five RL agents.
  • Figure 2: \ref{['algo:1']}: Evolution of the first entries of $\theta_t^1$, $\theta_t^2$, $\theta_t^3$, $\theta_t^4$, and $\theta_t^5$.
  • Figure 3: \ref{['algo:1']}: Evolution of the second entries of $\theta_t^1$, $\theta_t^2$, $\theta_t^3$, $\theta_t^4$, and $\theta_t^5$.
  • Figure 4: \ref{['algo:1']}: Evolution of ${\left\| {{{\bar{\theta} }_t} - {{\bar{\theta} }^*}} \right\|_2}$.
  • Figure 5: \ref{['algo:2']}: Evolution of the first entries of $w_t^1$, $w_t^2$, $w_t^3$, $w_t^4$, and $w_t^5$.
  • ...and 2 more figures

Theorems & Definitions (7)

  • Proposition 1
  • proof
  • Proposition 2: Equilibrium points
  • Proposition 3: Global asymptotic stability
  • Proposition 4: Equilibrium points
  • Proposition 5: Global asymptotic stability
  • Example 1