Table of Contents
Fetching ...

Federated Natural Policy Gradient and Actor Critic Methods for Multi-task Reinforcement Learning

Tong Yang, Shicong Cen, Yuting Wei, Yuxin Chen, Yuejie Chi

TL;DR

This work considers a multi-task setting, in which each agent has its own private reward function corresponding to different tasks, while sharing the same transition kernel of the environment, and establishes the first time that near dimension-free global convergence is established for federated multi-task RL using policy optimization.

Abstract

Federated reinforcement learning (RL) enables collaborative decision making of multiple distributed agents without sharing local data trajectories. In this work, we consider a multi-task setting, in which each agent has its own private reward function corresponding to different tasks, while sharing the same transition kernel of the environment. Focusing on infinite-horizon Markov decision processes, the goal is to learn a globally optimal policy that maximizes the sum of the discounted total rewards of all the agents in a decentralized manner, where each agent only communicates with its neighbors over some prescribed graph topology. We develop federated vanilla and entropy-regularized natural policy gradient (NPG) methods in the tabular setting under softmax parameterization, where gradient tracking is applied to estimate the global Q-function to mitigate the impact of imperfect information sharing. We establish non-asymptotic global convergence guarantees under exact policy evaluation, where the rates are nearly independent of the size of the state-action space and illuminate the impacts of network size and connectivity. To the best of our knowledge, this is the first time that near dimension-free global convergence is established for federated multi-task RL using policy optimization. We further go beyond the tabular setting by proposing a federated natural actor critic (NAC) method for multi-task RL with function approximation, and establish its finite-time sample complexity taking the errors of function approximation into account.

Federated Natural Policy Gradient and Actor Critic Methods for Multi-task Reinforcement Learning

TL;DR

This work considers a multi-task setting, in which each agent has its own private reward function corresponding to different tasks, while sharing the same transition kernel of the environment, and establishes the first time that near dimension-free global convergence is established for federated multi-task RL using policy optimization.

Abstract

Federated reinforcement learning (RL) enables collaborative decision making of multiple distributed agents without sharing local data trajectories. In this work, we consider a multi-task setting, in which each agent has its own private reward function corresponding to different tasks, while sharing the same transition kernel of the environment. Focusing on infinite-horizon Markov decision processes, the goal is to learn a globally optimal policy that maximizes the sum of the discounted total rewards of all the agents in a decentralized manner, where each agent only communicates with its neighbors over some prescribed graph topology. We develop federated vanilla and entropy-regularized natural policy gradient (NPG) methods in the tabular setting under softmax parameterization, where gradient tracking is applied to estimate the global Q-function to mitigate the impact of imperfect information sharing. We establish non-asymptotic global convergence guarantees under exact policy evaluation, where the rates are nearly independent of the size of the state-action space and illuminate the impacts of network size and connectivity. To the best of our knowledge, this is the first time that near dimension-free global convergence is established for federated multi-task RL using policy optimization. We further go beyond the tabular setting by proposing a federated natural actor critic (NAC) method for multi-task RL with function approximation, and establish its finite-time sample complexity taking the errors of function approximation into account.
Paper Structure (82 sections, 34 theorems, 348 equations, 1 table, 5 algorithms)

This paper contains 82 sections, 34 theorems, 348 equations, 1 table, 5 algorithms.

Key Result

Theorem 1

Suppose $\pi_n^{(0)},n\in[N]$ are set as the uniform distribution. Then for $0<\eta\leq \eta_1\coloneqq \frac{(1-\sigma)^2(1-\gamma)^3}{16\sqrt{N}\sigma}$, we have Furthermore, the consensus error satisfies

Theorems & Definitions (51)

  • Definition 1: spectral radius
  • Theorem 1: Global sublinear convergence of exact FedNPG (informal)
  • Corollary 1: Iteration complexity of exact FedNPG
  • Theorem 2: Global sublinear convergence of inexact FedNPG (informal)
  • Remark 1: sample complexity bound of inexact FedNPG
  • Theorem 3: Global linear convergence of exact entropy-regularized FedNPG (informal)
  • Corollary 2: Iteration complexity of exact entropy-regularized FedNPG
  • Theorem 4: Global linear convergence of inexact entropy-regularized FedNPG (informal)
  • Theorem 5: Convergence rate of Algorithm \ref{['alg:actor_critic']} (informal)
  • Theorem 6
  • ...and 41 more