Federated Offline Reinforcement Learning: Collaborative Single-Policy Coverage Suffices

Jiin Woo; Laixi Shi; Gauri Joshi; Yuejie Chi

Federated Offline Reinforcement Learning: Collaborative Single-Policy Coverage Suffices

Jiin Woo, Laixi Shi, Gauri Joshi, Yuejie Chi

TL;DR

This work tackles offline reinforcement learning in a federated setting, where multiple agents hold private offline datasets and a central server coordinates learning without data sharing. It introduces FedLCB-Q, a model-free Q-learning variant that integrates learning-rate rescaling, importance averaging, and a global pessimism penalty to manage local uncertainty and prevent overestimation in unseen state-action space. Theoretical guarantees show that FedLCB-Q achieves linear speedup in the number of agents and near-central sample complexity, with communication rounds scaling as $\widetilde{O}(H)$ under suitable synchronization. The results demonstrate that collaborative coverage across agents can approach the performance of centralized data processing, even when individual datasets are imperfect, provided that their collective coverage includes the optimal policy trajectories. This work thus offers a practical, communication-efficient paradigm for federated offline RL with provable performance guarantees.

Abstract

Offline reinforcement learning (RL), which seeks to learn an optimal policy using offline data, has garnered significant interest due to its potential in critical applications where online data collection is infeasible or expensive. This work explores the benefit of federated learning for offline RL, aiming at collaboratively leveraging offline datasets at multiple agents. Focusing on finite-horizon episodic tabular Markov decision processes (MDPs), we design FedLCB-Q, a variant of the popular model-free Q-learning algorithm tailored for federated offline RL. FedLCB-Q updates local Q-functions at agents with novel learning rate schedules and aggregates them at a central server using importance averaging and a carefully designed pessimistic penalty term. Our sample complexity analysis reveals that, with appropriately chosen parameters and synchronization schedules, FedLCB-Q achieves linear speedup in terms of the number of agents without requiring high-quality datasets at individual agents, as long as the local datasets collectively cover the state-action space visited by the optimal policy, highlighting the power of collaboration in the federated setting. In fact, the sample complexity almost matches that of the single-agent counterpart, as if all the data are stored at a central location, up to polynomial factors of the horizon length. Furthermore, FedLCB-Q is communication-efficient, where the number of communication rounds is only linear with respect to the horizon length up to logarithmic factors.

Federated Offline Reinforcement Learning: Collaborative Single-Policy Coverage Suffices

TL;DR

under suitable synchronization. The results demonstrate that collaborative coverage across agents can approach the performance of centralized data processing, even when individual datasets are imperfect, provided that their collective coverage includes the optimal policy trajectories. This work thus offers a practical, communication-efficient paradigm for federated offline RL with provable performance guarantees.

Abstract

Paper Structure (66 sections, 11 theorems, 120 equations, 3 figures, 1 table, 3 algorithms)

This paper contains 66 sections, 11 theorems, 120 equations, 3 figures, 1 table, 3 algorithms.

Introduction
Federated offline RL.
Our contribution
Related work
Offline RL.
Federated RL.
Q-learning.
Notation.
Background and problem formulation
Background
Basics of episodic finite-horizon MDPs.
Bellman equations.
Problem formulation: federated offline RL
Goal.
Metric.
...and 51 more sections

Key Result

Theorem 1

Consider $\delta \in (0,1)$ and let $\widehat{\pi}$ be the solution policy of FedLCB-Q. If a synchronization schedule ${\mathcal{T}}(K)$ is independent of trajectories in datasets $\mathcal{D}$ and satisfies for any $u\ge1$, where $\tau_u$ is the number of episodes between the $(u-1)$-th and the $u$-th aggregations. Denoting the total number of samples per agent $T=KH$, the following holds: at

Figures (3)

Figure 1: FedLCB-Q with $M$ agents and a central server. Each agent $m$ performs local updates on its local Q-table $Q_k^m$ for each $k$th episode in a local history dataset $\mathcal{D}^m$. When synchronization is scheduled at $k \in {\mathcal{T}}(K)$, the agents send their local Q-tables to the server and the server aggregates the Q-tables into a global Q-table and synchronizes local Q-tables.
Figure 2: Illustration of the periodic synchronization with constant period $\tau$ and the exponential synchronization with a rate $\gamma$.
Figure 3: Illustration of the rescaled learning rates ($\eta_{i,h}^m(s,a)$) and the episode weights ($\omega_{i,60,h}^m(s,a)$) induced by the learning rates of two agents $m=0,1$ for episodes $1\le i \le 60$, where $H=5$, the occupancy distribution of each agent on $(s,a,h)\in {\mathcal{S}} \times \mathcal{A} \times [5]$ is $d_h^0(s,a) = 0.7$ and $d_h^1(s,a) = 0.3$, respectively, and the synchronization schedule is ${\mathcal{T}}(60)= \{ 10,30, 60 \}$.

Theorems & Definitions (15)

Definition 1: single-policy clipped concentrability
Definition 2: average single-policy clipped concentrability
Theorem 1
Corollary 1
Lemma 1: Q-estimation error decomposition
Lemma 2: Concentration bound on the visitation counters
Lemma 3: Pessimistic global value
Theorem 2: li2021syncq
Lemma 6
proof
...and 5 more

Federated Offline Reinforcement Learning: Collaborative Single-Policy Coverage Suffices

TL;DR

Abstract

Federated Offline Reinforcement Learning: Collaborative Single-Policy Coverage Suffices

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (15)