Federated Offline Reinforcement Learning

Doudou Zhou; Yufeng Zhang; Aaron Sonabend-W; Zhaoran Wang; Junwei Lu; Tianxi Cai

Federated Offline Reinforcement Learning

Doudou Zhou, Yufeng Zhang, Aaron Sonabend-W, Zhaoran Wang, Junwei Lu, Tianxi Cai

TL;DR

The paper tackles learning offline dynamic treatment regimes from privacy-preserving, multi-site healthcare data by formulating a multi-site episodic linear MDP with homogeneous and site-specific effects. It introduces FDTR, a two-step federated policy optimization method that first learns locally with pessimism and then updates policies using one-shot summary statistics across $K$ sites and horizon $H$, achieving single-round communication. Theoretical guarantees are provided for suboptimality without requiring strong action coverage, with explicit rates and a non-parametric extension; pooling gains are realized for homogeneous components via the effective sample size $N=\sum_k n_k$. Empirical evidence from simulations and a sepsis study across multiple ICUs demonstrates FDTR’s competitive performance and its ability to recover interpretable site-aware policies. The work advances privacy-preserving, heterogeneity-aware federated offline RL with practical clinical implications.

Abstract

Evidence-based or data-driven dynamic treatment regimes are essential for personalized medicine, which can benefit from offline reinforcement learning (RL). Although massive healthcare data are available across medical institutions, they are prohibited from sharing due to privacy constraints. Besides, heterogeneity exists in different sites. As a result, federated offline RL algorithms are necessary and promising to deal with the problems. In this paper, we propose a multi-site Markov decision process model that allows for both homogeneous and heterogeneous effects across sites. The proposed model makes the analysis of the site-level features possible. We design the first federated policy optimization algorithm for offline RL with sample complexity. The proposed algorithm is communication-efficient, which requires only a single round of communication interaction by exchanging summary statistics. We give a theoretical guarantee for the proposed algorithm, where the suboptimality for the learned policies is comparable to the rate as if data is not distributed. Extensive simulations demonstrate the effectiveness of the proposed algorithm. The method is applied to a sepsis dataset in multiple sites to illustrate its use in clinical settings.

Federated Offline Reinforcement Learning

TL;DR

sites and horizon

, achieving single-round communication. Theoretical guarantees are provided for suboptimality without requiring strong action coverage, with explicit rates and a non-parametric extension; pooling gains are realized for homogeneous components via the effective sample size

. Empirical evidence from simulations and a sepsis study across multiple ICUs demonstrates FDTR’s competitive performance and its ability to recover interpretable site-aware policies. The work advances privacy-preserving, heterogeneity-aware federated offline RL with practical clinical implications.

Abstract

Paper Structure (12 sections, 7 theorems, 19 equations, 3 figures, 2 algorithms)

This paper contains 12 sections, 7 theorems, 19 equations, 3 figures, 2 algorithms.

Introduction
Overview of the Proposed Model and Algorithm
Related Work
Our Contributions
Multi-site MDP Model
Federated Dynamic Treatment Regimes Algorithm
Theoretical Analysis
Extension to Non-parametric Estimation
Experiments
Simulations
FDTR for Sepsis treatment Across Intensive Care Units
Discussion

Key Result

Theorem 1

In Algorithm alg0, we set $\lambda=1$, $\alpha_k = c dH\sqrt{\zeta_k}$, where $\zeta_k = \log(2dH n_k/\xi)$, $c>0$ is an absolute constant and $\xi \in (0,1)$ is the confidence parameter. Then $\{\widetilde{\Gamma}^k_h\}_{h=1}^H$ in Algorithm alg0 is a $\xi$-multi-site confidence bound of $\widetild

Figures (3)

Figure 1: An illustration of FDTR.
Figure 2: Mean value function for FDTR and benchmarks trained on $K=$5 sites for (\ref{['rfidtest_a']}), (\ref{['rfidtest_c']}) and $K=10$ for (\ref{['rfidtest_b']}). We show the value function averaged over the $K$ sites for increasing sample size. Error bars show $95\%$ CI. Finally, $(d,|\mathcal{A}|,H)$ stands respectively for the dimension of state space, cardinality of action space, and episode length.
Figure 3: (\ref{['a']}) Value function estimates on the sepsis held out data across $9$ ICU. (\ref{['b']}) Homogeneous coefficients estimated by FDTR at different time points.

Theorems & Definitions (13)

Definition 1: Multi-site confidence bound
Remark 1
Theorem 1: Suboptimality of the Preliminary Estimators
Theorem 2: Suboptimality of FDTR
Remark 2
Theorem 3: Estimation error for the homogeneous effects
Corollary 1: Suboptimality of FDTR with Well-Explored Dataset
Remark 3
Remark 4
Remark 5
...and 3 more

Federated Offline Reinforcement Learning

TL;DR

Abstract

Federated Offline Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (13)