A primal-dual perspective for distributed TD-learning
Han-Dong Lim, Donghwan Lee
TL;DR
The paper addresses distributed policy evaluation in networked multi-agent MDPs by recasting learning updates as primal–dual gradient dynamics with null-space constraints. It develops a distributed TD-learning algorithm that operates without a doubly stochastic communication matrix and provides finite-time, exponential convergence guarantees under both i.i.d. and Markov observation models, for constant and diminishing step-sizes. The key contributions are (i) a primal–dual ODE analysis with sharp rates that account for graph spectra and (ii) finite-time bounds for distributed TD-learning in both observation models, validated by experiments across various network topologies. This work advances scalable, robust multi-agent reinforcement learning by enabling distributed evaluation on more general and uncertain networks.
Abstract
The goal of this paper is to investigate distributed temporal difference (TD) learning for a networked multi-agent Markov decision process. The proposed approach is based on distributed optimization algorithms, which can be interpreted as primal-dual Ordinary differential equation (ODE) dynamics subject to null-space constraints. Based on the exponential convergence behavior of the primal-dual ODE dynamics subject to null-space constraints, we examine the behavior of the final iterate in various distributed TD-learning scenarios, considering both constant and diminishing step-sizes and incorporating both i.i.d. and Markovian observation models. Unlike existing methods, the proposed algorithm does not require the assumption that the underlying communication network structure is characterized by a doubly stochastic matrix.
