Federated Offline Reinforcement Learning: Collaborative Single-Policy Coverage Suffices
Jiin Woo, Laixi Shi, Gauri Joshi, Yuejie Chi
TL;DR
This work tackles offline reinforcement learning in a federated setting, where multiple agents hold private offline datasets and a central server coordinates learning without data sharing. It introduces FedLCB-Q, a model-free Q-learning variant that integrates learning-rate rescaling, importance averaging, and a global pessimism penalty to manage local uncertainty and prevent overestimation in unseen state-action space. Theoretical guarantees show that FedLCB-Q achieves linear speedup in the number of agents and near-central sample complexity, with communication rounds scaling as $\widetilde{O}(H)$ under suitable synchronization. The results demonstrate that collaborative coverage across agents can approach the performance of centralized data processing, even when individual datasets are imperfect, provided that their collective coverage includes the optimal policy trajectories. This work thus offers a practical, communication-efficient paradigm for federated offline RL with provable performance guarantees.
Abstract
Offline reinforcement learning (RL), which seeks to learn an optimal policy using offline data, has garnered significant interest due to its potential in critical applications where online data collection is infeasible or expensive. This work explores the benefit of federated learning for offline RL, aiming at collaboratively leveraging offline datasets at multiple agents. Focusing on finite-horizon episodic tabular Markov decision processes (MDPs), we design FedLCB-Q, a variant of the popular model-free Q-learning algorithm tailored for federated offline RL. FedLCB-Q updates local Q-functions at agents with novel learning rate schedules and aggregates them at a central server using importance averaging and a carefully designed pessimistic penalty term. Our sample complexity analysis reveals that, with appropriately chosen parameters and synchronization schedules, FedLCB-Q achieves linear speedup in terms of the number of agents without requiring high-quality datasets at individual agents, as long as the local datasets collectively cover the state-action space visited by the optimal policy, highlighting the power of collaboration in the federated setting. In fact, the sample complexity almost matches that of the single-agent counterpart, as if all the data are stored at a central location, up to polynomial factors of the horizon length. Furthermore, FedLCB-Q is communication-efficient, where the number of communication rounds is only linear with respect to the horizon length up to logarithmic factors.
