Table of Contents
Fetching ...

On Global Convergence Rates for Federated Policy Gradient under Heterogeneous Environment

Safwan Labbi, Paul Mangold, Daniil Tiapkin, Eric Moulines

TL;DR

This paper addresses federated reinforcement learning in environments where agents experience heterogeneous transitions. It proves that global convergence can be achieved for policy-gradient methods under local Łojasiewicz-type conditions, and shows that entropy regularization yields linear convergence with a linear speedup in the number of agents. To tackle large action spaces and heterogeneity, it introduces a softmax-based FedPG family and a novel bit-level parameterization (b-RS-FedPG) with tailored regularization, deriving explicit convergence rates to near-optimal stationary policies. Empirical results on heterogeneous FRL benchmarks demonstrate superior performance of FedPG and b-RS-FedPG compared to federated Q-learning, highlighting practical impact for privacy-preserving, communication-efficient multi-agent learning. Future work points toward achieving exact optimal convergence in heterogeneous FRL and extending bit-level ideas to broader action spaces.

Abstract

Ensuring convergence of policy gradient methods in federated reinforcement learning (FRL) under environment heterogeneity remains a major challenge. In this work, we first establish that heterogeneity, perhaps counter-intuitively, can necessitate optimal policies to be non-deterministic or even time-varying, even in tabular environments. Subsequently, we prove global convergence results for federated policy gradient (FedPG) algorithms employing local updates, under a Łojasiewicz condition that holds only for each individual agent, in both entropy-regularized and non-regularized scenarios. Crucially, our theoretical analysis shows that FedPG attains linear speed-up with respect to the number of agents, a property central to efficient federated learning. Leveraging insights from our theoretical findings, we introduce b-RS-FedPG, a novel policy gradient method that employs a carefully constructed softmax-inspired parameterization coupled with an appropriate regularization scheme. We further demonstrate explicit convergence rates for b-RS-FedPG toward near-optimal stationary policies. Finally, we demonstrate that empirically both FedPG and b-RS-FedPG consistently outperform federated Q-learning on heterogeneous settings.

On Global Convergence Rates for Federated Policy Gradient under Heterogeneous Environment

TL;DR

This paper addresses federated reinforcement learning in environments where agents experience heterogeneous transitions. It proves that global convergence can be achieved for policy-gradient methods under local Łojasiewicz-type conditions, and shows that entropy regularization yields linear convergence with a linear speedup in the number of agents. To tackle large action spaces and heterogeneity, it introduces a softmax-based FedPG family and a novel bit-level parameterization (b-RS-FedPG) with tailored regularization, deriving explicit convergence rates to near-optimal stationary policies. Empirical results on heterogeneous FRL benchmarks demonstrate superior performance of FedPG and b-RS-FedPG compared to federated Q-learning, highlighting practical impact for privacy-preserving, communication-efficient multi-agent learning. Future work points toward achieving exact optimal convergence in heterogeneous FRL and extending bit-level ideas to broader action spaces.

Abstract

Ensuring convergence of policy gradient methods in federated reinforcement learning (FRL) under environment heterogeneity remains a major challenge. In this work, we first establish that heterogeneity, perhaps counter-intuitively, can necessitate optimal policies to be non-deterministic or even time-varying, even in tabular environments. Subsequently, we prove global convergence results for federated policy gradient (FedPG) algorithms employing local updates, under a Łojasiewicz condition that holds only for each individual agent, in both entropy-regularized and non-regularized scenarios. Crucially, our theoretical analysis shows that FedPG attains linear speed-up with respect to the number of agents, a property central to efficient federated learning. Leveraging insights from our theoretical findings, we introduce b-RS-FedPG, a novel policy gradient method that employs a carefully constructed softmax-inspired parameterization coupled with an appropriate regularization scheme. We further demonstrate explicit convergence rates for b-RS-FedPG toward near-optimal stationary policies. Finally, we demonstrate that empirically both FedPG and b-RS-FedPG consistently outperform federated Q-learning on heterogeneous settings.

Paper Structure

This paper contains 53 sections, 60 theorems, 322 equations, 7 figures, 1 table, 2 algorithms.

Key Result

Theorem 3.1

For each of the following properties, there exists an FRL instance with two infinite-horizon discounted MDPs that satisfy

Figures (7)

  • Figure 1: Comparison of \ref{['algo:FEDPG']} (crosses), \ref{['algo:FEDPG']} (circles), \ref{['algo:FEDPG']} (triangles), and Fed-Q-learning (squares): (a) Value of the global objective $J(\theta^r)$ in the Synthetic environment, for the three \ref{['algo:FEDPG']} variants and different numbers of agents $M \in \{2, 10, 50\}$, shown on a log-log scale; (b) Value of $J(\theta^r)$ in the Synthetic environment, comparing all four algorithms; (c) Value of $J(\theta^r)$ in the GridWorld environment, comparing all four algorithms.
  • Figure 2: FRL task with no optimal local history-dependant policy. The triplet means (action, probability, reward) and $\gamma = 0.9$. Note that these two environments share the same action space, same state space, and same reward function.
  • Figure 3: FRL task with no optimal stationary policy. The triplet means (action, probability, reward) and $\gamma = 0.9$. If the action is not specified, it means that all the actions give the same reward and lead to the same state
  • Figure 4: FRL task with no optimal local deterministic policy. The triplet means (action, probability, reward) and $\gamma = 0.9$. If the action is not specified, it means that all the actions give the same reward and lead to the same state
  • Figure 5: FRL task with no optimal local deterministic policy. The triplet means (action, probability, reward) , $\gamma = 0.999$, and $\lambda = 1$. If the action is not specified, it means that all the actions give the same reward and lead to the same state.
  • ...and 2 more figures

Theorems & Definitions (102)

  • Theorem 3.1
  • Lemma 4.1: Ascent Lemma
  • Theorem 4.2: Convergence rates of $\ref{['algo:FEDPG']}$
  • Corollary 4.3: Sample and Communication Complexity of $\ref{['algo:FEDPG']}$
  • Corollary 4.4: Sample and Communication Complexity of $\ref{['algo:FEDPG']}$
  • Corollary 4.5: Sample and Communication Complexity of $\ref{['algo:FEDPG']}$
  • Lemma B.1
  • proof
  • Lemma B.2
  • proof
  • ...and 92 more