Table of Contents
Fetching ...

A primal-dual perspective for distributed TD-learning

Han-Dong Lim, Donghwan Lee

TL;DR

The paper addresses distributed policy evaluation in networked multi-agent MDPs by recasting learning updates as primal–dual gradient dynamics with null-space constraints. It develops a distributed TD-learning algorithm that operates without a doubly stochastic communication matrix and provides finite-time, exponential convergence guarantees under both i.i.d. and Markov observation models, for constant and diminishing step-sizes. The key contributions are (i) a primal–dual ODE analysis with sharp rates that account for graph spectra and (ii) finite-time bounds for distributed TD-learning in both observation models, validated by experiments across various network topologies. This work advances scalable, robust multi-agent reinforcement learning by enabling distributed evaluation on more general and uncertain networks.

Abstract

The goal of this paper is to investigate distributed temporal difference (TD) learning for a networked multi-agent Markov decision process. The proposed approach is based on distributed optimization algorithms, which can be interpreted as primal-dual Ordinary differential equation (ODE) dynamics subject to null-space constraints. Based on the exponential convergence behavior of the primal-dual ODE dynamics subject to null-space constraints, we examine the behavior of the final iterate in various distributed TD-learning scenarios, considering both constant and diminishing step-sizes and incorporating both i.i.d. and Markovian observation models. Unlike existing methods, the proposed algorithm does not require the assumption that the underlying communication network structure is characterized by a doubly stochastic matrix.

A primal-dual perspective for distributed TD-learning

TL;DR

The paper addresses distributed policy evaluation in networked multi-agent MDPs by recasting learning updates as primal–dual gradient dynamics with null-space constraints. It develops a distributed TD-learning algorithm that operates without a doubly stochastic communication matrix and provides finite-time, exponential convergence guarantees under both i.i.d. and Markov observation models, for constant and diminishing step-sizes. The key contributions are (i) a primal–dual ODE analysis with sharp rates that account for graph spectra and (ii) finite-time bounds for distributed TD-learning in both observation models, validated by experiments across various network topologies. This work advances scalable, robust multi-agent reinforcement learning by enabling distributed evaluation on more general and uncertain networks.

Abstract

The goal of this paper is to investigate distributed temporal difference (TD) learning for a networked multi-agent Markov decision process. The proposed approach is based on distributed optimization algorithms, which can be interpreted as primal-dual Ordinary differential equation (ODE) dynamics subject to null-space constraints. Based on the exponential convergence behavior of the primal-dual ODE dynamics subject to null-space constraints, we examine the behavior of the final iterate in various distributed TD-learning scenarios, considering both constant and diminishing step-sizes and incorporating both i.i.d. and Markovian observation models. Unlike existing methods, the proposed algorithm does not require the assumption that the underlying communication network structure is characterized by a doubly stochastic matrix.
Paper Structure (24 sections, 23 theorems, 121 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 24 sections, 23 theorems, 121 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Lemma 2

Let ${\bm{S}}:=$ where $\beta := \max\left\{ \frac{2\lambda_{\max}({\bm{M}})^2 +2+ \left\| {\bm{U}} \right\|_2^2}{\lambda_{\min}({\bm{U}}+{\bm{U}}^{\top})} ,4\lambda_{\max}({\bm{M}}) \right\}$. Then, $\frac{\beta}{2} {\bm{I}}_{2n} \prec {\bm{S}} \prec 2\beta {\bm{I}}_{2n}$, and we have, for any ${\b

Figures (4)

  • Figure 1: Experiment results of Algorithm \ref{['algo:1']}. The experiments were averaged over 50 runs.
  • Figure 2: The doubly stochastic matrix was constructed by solving a least squares problem bai2007computing. We did not plot the result of wang2020decentralized, since it diverges. The step-size was chosen as $1/2^3$.
  • Figure 3: The doubly stochastic matrix was constructed by Sinkhorn-Knobb algorithm knight2008sinkhorn. The step-size was chosen as $1/2^3$.
  • Figure 4: Full plots for the result in Figure (\ref{['fig:step-size']}).

Theorems & Definitions (41)

  • Lemma 2
  • Theorem 3
  • Lemma 4
  • Theorem 5
  • Theorem 6
  • Definition 7: Doubly stochastic matrix doan2019finite
  • Lemma 8: pavlikova2023moore, p. 2
  • Lemma 9: Schur complement and symmetric positive definite matrices, Theorem 1.12 in horn2005basic
  • Lemma 10: Proposition 4.5 in levin2017markov
  • Lemma 11
  • ...and 31 more