On Global Convergence Rates for Federated Policy Gradient under Heterogeneous Environment
Safwan Labbi, Paul Mangold, Daniil Tiapkin, Eric Moulines
TL;DR
This paper addresses federated reinforcement learning in environments where agents experience heterogeneous transitions. It proves that global convergence can be achieved for policy-gradient methods under local Łojasiewicz-type conditions, and shows that entropy regularization yields linear convergence with a linear speedup in the number of agents. To tackle large action spaces and heterogeneity, it introduces a softmax-based FedPG family and a novel bit-level parameterization (b-RS-FedPG) with tailored regularization, deriving explicit convergence rates to near-optimal stationary policies. Empirical results on heterogeneous FRL benchmarks demonstrate superior performance of FedPG and b-RS-FedPG compared to federated Q-learning, highlighting practical impact for privacy-preserving, communication-efficient multi-agent learning. Future work points toward achieving exact optimal convergence in heterogeneous FRL and extending bit-level ideas to broader action spaces.
Abstract
Ensuring convergence of policy gradient methods in federated reinforcement learning (FRL) under environment heterogeneity remains a major challenge. In this work, we first establish that heterogeneity, perhaps counter-intuitively, can necessitate optimal policies to be non-deterministic or even time-varying, even in tabular environments. Subsequently, we prove global convergence results for federated policy gradient (FedPG) algorithms employing local updates, under a Łojasiewicz condition that holds only for each individual agent, in both entropy-regularized and non-regularized scenarios. Crucially, our theoretical analysis shows that FedPG attains linear speed-up with respect to the number of agents, a property central to efficient federated learning. Leveraging insights from our theoretical findings, we introduce b-RS-FedPG, a novel policy gradient method that employs a carefully constructed softmax-inspired parameterization coupled with an appropriate regularization scheme. We further demonstrate explicit convergence rates for b-RS-FedPG toward near-optimal stationary policies. Finally, we demonstrate that empirically both FedPG and b-RS-FedPG consistently outperform federated Q-learning on heterogeneous settings.
