Federated Offline Reinforcement Learning
Doudou Zhou, Yufeng Zhang, Aaron Sonabend-W, Zhaoran Wang, Junwei Lu, Tianxi Cai
TL;DR
The paper tackles learning offline dynamic treatment regimes from privacy-preserving, multi-site healthcare data by formulating a multi-site episodic linear MDP with homogeneous and site-specific effects. It introduces FDTR, a two-step federated policy optimization method that first learns locally with pessimism and then updates policies using one-shot summary statistics across $K$ sites and horizon $H$, achieving single-round communication. Theoretical guarantees are provided for suboptimality without requiring strong action coverage, with explicit rates and a non-parametric extension; pooling gains are realized for homogeneous components via the effective sample size $N=\sum_k n_k$. Empirical evidence from simulations and a sepsis study across multiple ICUs demonstrates FDTR’s competitive performance and its ability to recover interpretable site-aware policies. The work advances privacy-preserving, heterogeneity-aware federated offline RL with practical clinical implications.
Abstract
Evidence-based or data-driven dynamic treatment regimes are essential for personalized medicine, which can benefit from offline reinforcement learning (RL). Although massive healthcare data are available across medical institutions, they are prohibited from sharing due to privacy constraints. Besides, heterogeneity exists in different sites. As a result, federated offline RL algorithms are necessary and promising to deal with the problems. In this paper, we propose a multi-site Markov decision process model that allows for both homogeneous and heterogeneous effects across sites. The proposed model makes the analysis of the site-level features possible. We design the first federated policy optimization algorithm for offline RL with sample complexity. The proposed algorithm is communication-efficient, which requires only a single round of communication interaction by exchanging summary statistics. We give a theoretical guarantee for the proposed algorithm, where the suboptimality for the learned policies is comparable to the rate as if data is not distributed. Extensive simulations demonstrate the effectiveness of the proposed algorithm. The method is applied to a sepsis dataset in multiple sites to illustrate its use in clinical settings.
