Table of Contents
Fetching ...

Sample and Communication Efficient Fully Decentralized MARL Policy Evaluation via a New Approach: Local TD update

Fnu Hairi, Zifan Zhang, Jia Liu

TL;DR

This work tackles policy evaluation in fully decentralized cooperative MARL under the average-reward objective. It introduces a local TD-update scheme that performs many TD steps between communications, and provides finite-time analysis showing reduced communication complexity while maintaining competitive sample complexity. The authors prove that up to $K=O(1/\sqrt{\epsilon}\log(1/\epsilon))$ local updates per round suffice to reach an $\epsilon$-approximate solution, with overall sample complexity $KL=O(1/\epsilon\log^2(1/\epsilon))$ and communication complexity $L=O(1/\sqrt{\epsilon}\log(1/\epsilon))$. Empirical results on synthetic data and a cooperative navigation task validate the theory, demonstrating faster communication-constrained convergence and favorable comparisons to vanilla and batching approaches in decentralized MARL-PE.

Abstract

In actor-critic framework for fully decentralized multi-agent reinforcement learning (MARL), one of the key components is the MARL policy evaluation (PE) problem, where a set of $N$ agents work cooperatively to evaluate the value function of the global states for a given policy through communicating with their neighbors. In MARL-PE, a critical challenge is how to lower the sample and communication complexities, which are defined as the number of training samples and communication rounds needed to converge to some $ε$-stationary point. To lower communication complexity in MARL-PE, a "natural'' idea is to perform multiple local TD-update steps between each consecutive rounds of communication to reduce the communication frequency. However, the validity of the local TD-update approach remains unclear due to the potential "agent-drift'' phenomenon resulting from heterogeneous rewards across agents in general. This leads to an interesting open question: Can the local TD-update approach entail low sample and communication complexities? In this paper, we make the first attempt to answer this fundamental question. We focus on the setting of MARL-PE with average reward, which is motivated by many multi-agent network optimization problems. Our theoretical and experimental results confirm that allowing multiple local TD-update steps is indeed an effective approach in lowering the sample and communication complexities of MARL-PE compared to consensus-based MARL-PE algorithms. Specifically, the local TD-update steps between two consecutive communication rounds can be as large as $\mathcal{O}(1/ε^{1/2}\log{(1/ε)})$ in order to converge to an $ε$-stationary point of MARL-PE. Moreover, we show theoretically that in order to reach the optimal sample complexity, the communication complexity of local TD-update approach is $\mathcal{O}(1/ε^{1/2}\log{(1/ε)})$.

Sample and Communication Efficient Fully Decentralized MARL Policy Evaluation via a New Approach: Local TD update

TL;DR

This work tackles policy evaluation in fully decentralized cooperative MARL under the average-reward objective. It introduces a local TD-update scheme that performs many TD steps between communications, and provides finite-time analysis showing reduced communication complexity while maintaining competitive sample complexity. The authors prove that up to local updates per round suffice to reach an -approximate solution, with overall sample complexity and communication complexity . Empirical results on synthetic data and a cooperative navigation task validate the theory, demonstrating faster communication-constrained convergence and favorable comparisons to vanilla and batching approaches in decentralized MARL-PE.

Abstract

In actor-critic framework for fully decentralized multi-agent reinforcement learning (MARL), one of the key components is the MARL policy evaluation (PE) problem, where a set of agents work cooperatively to evaluate the value function of the global states for a given policy through communicating with their neighbors. In MARL-PE, a critical challenge is how to lower the sample and communication complexities, which are defined as the number of training samples and communication rounds needed to converge to some -stationary point. To lower communication complexity in MARL-PE, a "natural'' idea is to perform multiple local TD-update steps between each consecutive rounds of communication to reduce the communication frequency. However, the validity of the local TD-update approach remains unclear due to the potential "agent-drift'' phenomenon resulting from heterogeneous rewards across agents in general. This leads to an interesting open question: Can the local TD-update approach entail low sample and communication complexities? In this paper, we make the first attempt to answer this fundamental question. We focus on the setting of MARL-PE with average reward, which is motivated by many multi-agent network optimization problems. Our theoretical and experimental results confirm that allowing multiple local TD-update steps is indeed an effective approach in lowering the sample and communication complexities of MARL-PE compared to consensus-based MARL-PE algorithms. Specifically, the local TD-update steps between two consecutive communication rounds can be as large as in order to converge to an -stationary point of MARL-PE. Moreover, we show theoretically that in order to reach the optimal sample complexity, the communication complexity of local TD-update approach is .
Paper Structure (27 sections, 4 theorems, 53 equations, 20 figures, 1 table, 2 algorithms)

This paper contains 27 sections, 4 theorems, 53 equations, 20 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

Suppose that Assumptions ass: r_bou--ass: fea hold. For the consensus error generated by Algorithm alg: local_average, if $\beta K\le\min\{\frac{1}{2},\frac{\eta^{N-1}}{4(1-\eta^{N-1})}\}$, it then holds that where $\kappa_1=\frac{2N^{2}(1+\eta^{-(N-1)})}{1-\eta^{N-1}}$, $\kappa_2=8(1+\eta^{-(N-1)})N^{\frac{5}{2}}r_{\max}$ and $\rho:=(1+4\beta K)(1-\eta^{N-1})$. By the condition on $\beta K$, we

Figures (20)

  • Figure 1: Convergence with respect to the number of communication rounds and samples.
  • Figure 2: Convergence comparisons with different settings of $(K,L)$ and the impact of local TD-update steps $K$ on convergence performance.
  • Figure 3: A cooperative navigation task.
  • Figure 4: Convergence in terms of the number of communication rounds and training samples.
  • Figure 5: Network Topology
  • ...and 15 more figures

Theorems & Definitions (5)

  • Definition 1: Networked Multi-Agent MDP
  • Lemma 1
  • Lemma 2
  • Theorem 1
  • Theorem 2