Federated Q-Learning: Linear Regret Speedup with Low Communication Cost

Zhong Zheng; Fengyu Gao; Lingzhou Xue; Jing Yang

Federated Q-Learning: Linear Regret Speedup with Low Communication Cost

Zhong Zheng, Fengyu Gao, Lingzhou Xue, Jing Yang

TL;DR

This paper addresses learning in federated reinforcement learning for tabular episodic MDPs under data privacy constraints, proposing two model-free FedQ algorithms (FedQ-Hoeffding and FedQ-Bernstein) that achieve linear regret speedup in the number of agents while incurring logarithmic communication cost in the horizon $T$. The methods feature event-triggered rounds, adaptive exploration policies, and equal-weight aggregation, coupled with novel concentration bounds for sums of non-martingale differences; the Bernstein variant further tightens the regret by a factor of $\\sqrt{H}$. Theoretical guarantees show regret bounds of $\\tilde{O}(\\sqrt{H^4 S A M T})$ and $\\tilde{O}(\\sqrt{H^3 S A M T})$ for the Hoeffding and Bernstein versions respectively, both with $O(M^2 H^4 S^2 A \log(T/M))$ communication, demonstrating scalable, communication-efficient federated RL. The results suggest that model-free federated RL can achieve fast regret performance with low communication overhead, and the techniques may extend to other federated RL frameworks.

Abstract

In this paper, we consider federated reinforcement learning for tabular episodic Markov Decision Processes (MDP) where, under the coordination of a central server, multiple agents collaboratively explore the environment and learn an optimal policy without sharing their raw data. While linear speedup in the number of agents has been achieved for some metrics, such as convergence rate and sample complexity, in similar settings, it is unclear whether it is possible to design a model-free algorithm to achieve linear regret speedup with low communication cost. We propose two federated Q-Learning algorithms termed as FedQ-Hoeffding and FedQ-Bernstein, respectively, and show that the corresponding total regrets achieve a linear speedup compared with their single-agent counterparts when the time horizon is sufficiently large, while the communication cost scales logarithmically in the total number of time steps $T$. Those results rely on an event-triggered synchronization mechanism between the agents and the server, a novel step size selection when the server aggregates the local estimates of the state-action values to form the global estimates, and a set of new concentration inequalities to bound the sum of non-martingale differences. This is the first work showing that linear regret speedup and logarithmic communication cost can be achieved by model-free algorithms in federated reinforcement learning.

Federated Q-Learning: Linear Regret Speedup with Low Communication Cost

TL;DR

. The methods feature event-triggered rounds, adaptive exploration policies, and equal-weight aggregation, coupled with novel concentration bounds for sums of non-martingale differences; the Bernstein variant further tightens the regret by a factor of

. Theoretical guarantees show regret bounds of

and

for the Hoeffding and Bernstein versions respectively, both with

communication, demonstrating scalable, communication-efficient federated RL. The results suggest that model-free federated RL can achieve fast regret performance with low communication overhead, and the techniques may extend to other federated RL frameworks.

Abstract

. Those results rely on an event-triggered synchronization mechanism between the agents and the server, a novel step size selection when the server aggregates the local estimates of the state-action values to form the global estimates, and a set of new concentration inequalities to bound the sum of non-martingale differences. This is the first work showing that linear regret speedup and logarithmic communication cost can be achieved by model-free algorithms in federated reinforcement learning.

Paper Structure (24 sections, 17 theorems, 200 equations, 2 figures, 1 table, 4 algorithms)

This paper contains 24 sections, 17 theorems, 200 equations, 2 figures, 1 table, 4 algorithms.

Introduction
Background and Problem Formulation
Preliminaries
The Federated RL Framework
Algorithm Design
The FedQ-Hoeffding Algorithm
Intuition behind the Algorithm Design
Performance Guarantees
Extension to Bernstein-type Algorithm
Conclusion
Related Works
Auxiliary Lemmas
Proof of Theorem \ref{['thm_regret_hoeffding']}
Robustness against Asynchronization
Bounds on $Q_h^k - Q_h^\star$
...and 9 more sections

Key Result

Theorem 4.1

Let $\tilde{C} = 1/(H(H+1))$, $\iota = \max\{\iota_0,\iota_1\}$ where $\iota_0 = \log(2SA(T_0+HM)(1+\tilde{C})/p),\iota_1 = \log\frac{2K_0SAH(T_0/H + M)(1+\tilde{C})}{p}$, and $p\in(0,1)$. Define $b_t = c\sqrt{H^3\iota/t}$. Under alg_hoeffding_serveralg_hoeffding_agent, there exists a positive const where $T = H\sum_{k=1}^K n^k$ is the total number of steps in the first $K$ rounds.

Figures (2)

Figure 1: Regret comparison.
Figure 2: Total number of communication rounds as a function of $T/H$.

Theorems & Definitions (35)

Theorem 4.1: Regret Upper Bound for FedQ-Hoeffding
proof : Proof Sketch of \ref{['thm_regret_hoeffding']}
Theorem 4.2: Communication Cost
proof : Proof Sketch of \ref{['thm_comm_cost_hoeffding']}
Theorem 5.1: Regret Upper Bound for FedQ-Bernstein
Lemma B.1
proof : Proof of Lemma \ref{['lemma_relationship_TK']}
Lemma B.2
proof : Proof of Lemma \ref{['property_theta']}
Lemma B.3
...and 25 more

Federated Q-Learning: Linear Regret Speedup with Low Communication Cost

TL;DR

Abstract

Federated Q-Learning: Linear Regret Speedup with Low Communication Cost

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (35)