Asynchronous Federated Reinforcement Learning with Policy Gradient Updates: Algorithm Design and Convergence Analysis

Guangchen Lan; Dong-Jun Han; Abolfazl Hashemi; Vaneet Aggarwal; Christopher G. Brinton

Asynchronous Federated Reinforcement Learning with Policy Gradient Updates: Algorithm Design and Convergence Analysis

Guangchen Lan, Dong-Jun Han, Abolfazl Hashemi, Vaneet Aggarwal, Christopher G. Brinton

TL;DR

The theoretical global convergence bound of AFedPG is analyzed, the advantage of the proposed algorithm in terms of both the sample complexity and time complexity is characterized, and the improved performance of AFedPG is empirically verified in four widely used MuJoCo environments.

Abstract

To improve the efficiency of reinforcement learning (RL), we propose a novel asynchronous federated reinforcement learning (FedRL) framework termed AFedPG, which constructs a global model through collaboration among $N$ agents using policy gradient (PG) updates. To address the challenge of lagged policies in asynchronous settings, we design a delay-adaptive lookahead technique \textit{specifically for FedRL} that can effectively handle heterogeneous arrival times of policy gradients. We analyze the theoretical global convergence bound of AFedPG, and characterize the advantage of the proposed algorithm in terms of both the sample complexity and time complexity. Specifically, our AFedPG method achieves $O(\frac{ε^{-2.5}}{N})$ sample complexity for global convergence at each agent on average. Compared to the single agent setting with $O(ε^{-2.5})$ sample complexity, it enjoys a linear speedup with respect to the number of agents. Moreover, compared to synchronous FedPG, AFedPG improves the time complexity from $O(\frac{t_{\max}}{N})$ to $O({\sum_{i=1}^{N} \frac{1}{t_{i}}})^{-1}$, where $t_{i}$ denotes the time consumption in each iteration at agent $i$, and $t_{\max}$ is the largest one. The latter complexity $O({\sum_{i=1}^{N} \frac{1}{t_{i}}})^{-1}$ is always smaller than the former one, and this improvement becomes significant in large-scale federated settings with heterogeneous computing powers ($t_{\max}\gg t_{\min}$). Finally, we empirically verify the improved performance of AFedPG in four widely used MuJoCo environments with varying numbers of agents. We also demonstrate the advantages of AFedPG in various computing heterogeneity scenarios.

Asynchronous Federated Reinforcement Learning with Policy Gradient Updates: Algorithm Design and Convergence Analysis

TL;DR

Abstract

agents using policy gradient (PG) updates. To address the challenge of lagged policies in asynchronous settings, we design a delay-adaptive lookahead technique \textit{specifically for FedRL} that can effectively handle heterogeneous arrival times of policy gradients. We analyze the theoretical global convergence bound of AFedPG, and characterize the advantage of the proposed algorithm in terms of both the sample complexity and time complexity. Specifically, our AFedPG method achieves

sample complexity for global convergence at each agent on average. Compared to the single agent setting with

sample complexity, it enjoys a linear speedup with respect to the number of agents. Moreover, compared to synchronous FedPG, AFedPG improves the time complexity from

, where

denotes the time consumption in each iteration at agent

, and

is the largest one. The latter complexity

is always smaller than the former one, and this improvement becomes significant in large-scale federated settings with heterogeneous computing powers (

). Finally, we empirically verify the improved performance of AFedPG in four widely used MuJoCo environments with varying numbers of agents. We also demonstrate the advantages of AFedPG in various computing heterogeneity scenarios.

Paper Structure (23 sections, 9 theorems, 46 equations, 8 figures, 3 tables, 2 algorithms)

This paper contains 23 sections, 9 theorems, 46 equations, 8 figures, 3 tables, 2 algorithms.

Introduction
Summary of Contributions
Related Work
Problem Setup
Proposed Asynchronous FedPG
Convergence Analysis
Experiments
Setup
Results
Discussions
Supplementary Results
Comparison to the Synchronous Setting
Supplementary Experimental Settings
Supplementary Experiments
Theoretical Proofs
...and 8 more sections

Key Result

Theorem 5.1

(Global) Let Assumption assum:policy and assum:func_approx hold. With suitable learning rates $\eta_{k}$ and $\alpha_{k}$, after $K$ global iterations, AFedPG satisfies where $\epsilon_{{\rm bias}}$ is from equation eq:approx_error. Thus, to satisfy $J^{\star} - J(\theta_{K}) \leq \epsilon + \frac{\sqrt{\epsilon_{{\rm bias}}}}{1-\gamma}$, we need $K = \mathcal{O}(\frac{\epsilon^{-2.5}}{(1-\gamma)

Figures (8)

Figure 1: An illustration of the asynchronous federated policy gradient updates. Each agent has a local copy of the environment, and agents may collect data according to different local policies. At each iteration, the agent in the yellow color finishes the local process and then communicates with the server, while the other agents keep sampling and computing local gradients in parallel. In the $k$-th global iteration, $\delta_{k} \in\mathbb{N}$ is the delay, $\widetilde{\tau}_{k-\delta_{k} }$ is the sample collected according to the policy $\pi_{\widetilde{\theta}_{k-\delta_{k} }}$, and $d_{k-\delta_{k}}$ is the updating direction calculated from the sample $\widetilde{\tau}_{k-\delta_{k}}$.
Figure 2: Visualization of the four MuJoCo tasks considered in this paper for experiments.
Figure 3: Reward performances of AFedPG ($N=2,4,8$) and PG ($N=1$) on various MuJoCo environments, where $N$ is the number of federated agents. The solid lines are averaged results over $10$ runs with random seeds from $0$ to $9$. The shadowed areas are confidence intervals with $95\%$ confidence level.
Figure 4: Global time of AFedPG and FedPG with certain numbers of collected samples on various MuJoCo environments, where $N$ is the number of federated agents. The solid lines are averaged results over $10$ runs. The shadowed areas are confidence intervals with $95\%$ confidence level.
Figure 5: Comparison of time consumptions between synchronous and asynchronous approaches. The circled numbers denote the indices of global steps.
...and 3 more figures

Theorems & Definitions (14)

Definition 4.1
Definition 4.2
Theorem 5.1
Theorem 5.2
Lemma B.4
Lemma B.5
proof
Lemma B.6
Lemma B.7
Lemma B.8
...and 4 more

Asynchronous Federated Reinforcement Learning with Policy Gradient Updates: Algorithm Design and Convergence Analysis

TL;DR

Abstract

Asynchronous Federated Reinforcement Learning with Policy Gradient Updates: Algorithm Design and Convergence Analysis

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (14)