Table of Contents
Fetching ...

Federated Temporal Difference Learning with Linear Function Approximation under Environmental Heterogeneity

Han Wang, Aritra Mitra, Hamed Hassani, George J. Pappas, James Anderson

TL;DR

This work provides the first comprehensive finite-time analysis of a federated temporal difference (TD) learning algorithm with linear function approximation, while accounting for Markovian sampling, heterogeneity in the agents' environments, and multiple local updates to save communication.

Abstract

We initiate the study of federated reinforcement learning under environmental heterogeneity by considering a policy evaluation problem. Our setup involves $N$ agents interacting with environments that share the same state and action space but differ in their reward functions and state transition kernels. Assuming agents can communicate via a central server, we ask: Does exchanging information expedite the process of evaluating a common policy? To answer this question, we provide the first comprehensive finite-time analysis of a federated temporal difference (TD) learning algorithm with linear function approximation, while accounting for Markovian sampling, heterogeneity in the agents' environments, and multiple local updates to save communication. Our analysis crucially relies on several novel ingredients: (i) deriving perturbation bounds on TD fixed points as a function of the heterogeneity in the agents' underlying Markov decision processes (MDPs); (ii) introducing a virtual MDP to closely approximate the dynamics of the federated TD algorithm; and (iii) using the virtual MDP to make explicit connections to federated optimization. Putting these pieces together, we rigorously prove that in a low-heterogeneity regime, exchanging model estimates leads to linear convergence speedups in the number of agents.

Federated Temporal Difference Learning with Linear Function Approximation under Environmental Heterogeneity

TL;DR

This work provides the first comprehensive finite-time analysis of a federated temporal difference (TD) learning algorithm with linear function approximation, while accounting for Markovian sampling, heterogeneity in the agents' environments, and multiple local updates to save communication.

Abstract

We initiate the study of federated reinforcement learning under environmental heterogeneity by considering a policy evaluation problem. Our setup involves agents interacting with environments that share the same state and action space but differ in their reward functions and state transition kernels. Assuming agents can communicate via a central server, we ask: Does exchanging information expedite the process of evaluating a common policy? To answer this question, we provide the first comprehensive finite-time analysis of a federated temporal difference (TD) learning algorithm with linear function approximation, while accounting for Markovian sampling, heterogeneity in the agents' environments, and multiple local updates to save communication. Our analysis crucially relies on several novel ingredients: (i) deriving perturbation bounds on TD fixed points as a function of the heterogeneity in the agents' underlying Markov decision processes (MDPs); (ii) introducing a virtual MDP to closely approximate the dynamics of the federated TD algorithm; and (iii) using the virtual MDP to make explicit connections to federated optimization. Putting these pieces together, we rigorously prove that in a low-heterogeneity regime, exchanging model estimates leads to linear convergence speedups in the number of agents.
Paper Structure (54 sections, 24 theorems, 128 equations, 6 figures, 1 algorithm)

This paper contains 54 sections, 24 theorems, 128 equations, 6 figures, 1 algorithm.

Key Result

Lemma 1

(Perturbation bound on Stationary Distributions) Suppose Assumption hetroP holds. Then, for any pair of agents $i,j \in [N]$, the stationary distributions $\pi^{(i)}$ and $\pi^{(j)}$ satisfy:

Figures (6)

  • Figure 1: (Left) Illustration of how FedTD(0) works. Each agent performs $K$ local TD update steps on its own MDP, and transmits its updated model to a server. The virtual MDP serves to approximate the dynamics of FedTD(0). The global model $\bar{\theta}_t$ at the server is used to construct a linearly parameterized approximation of the value function associated with a policy $\mu$. (Right) FedTD(0) helps each agent converge quickly to a ball $\mathcal{B}(\theta^*, \epsilon)$ centered around the optimal parameter $\theta^*$ of the virtual MDP. Here, $\epsilon$ captures the heterogeneity in the agents' MDPs. Using the output $\bar{\theta}_T$ of FedTD(0), each agent $i$ can then fine-tune based on its own data to converge exactly to its own optimal parameter $\theta^*_i$.
  • Figure 2: Performance of FedTD(0) under Markovian sampling. $(a)$ Performance of FedTD(0) for varying number of agents $N$. The MDP $\mathcal{M}^{(1)}$ of the first agent is randomly generated with a state space of size $n=100$. The remaining MDPs are perturbations of $\mathcal{M}^{(1)}$ with the heterogeneity levels $\epsilon = 0.05$ and $\epsilon_1=0.1$. We evaluate the convergence in terms of the running error $e_t = \lVert \bar{\theta}_t - \theta_1^* \rVert^2$. $(b)$ Performance of FedTD(0) for varying heterogeneity level, with a fixed number of agents $N=20$. Complying with theory, increasing $N$ reduces the error, and increasing the level of heterogeneity increases the size of the ball to which FedTD(0) converges. We choose the number of local steps as $K=10$ in both plots.
  • Figure 3: Performance of FedTD(0) with i.i.d. sampling with varying number of agents $N$. Solid lines denote the mean and shaded regions indicate the standard deviation over ten runs.
  • Figure 4: Performance of FedTD(0) with the Markovian sampling with varying number of agents $N$. Solid lines denote the mean and shaded regions indicate the standard deviation over ten runs.
  • Figure : Description of FedTD(0)
  • ...and 1 more figures

Theorems & Definitions (41)

  • Lemma 1
  • Theorem 1
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Theorem 2
  • Theorem 3
  • Lemma 2
  • Lemma 3
  • proof
  • ...and 31 more