Kernel-Based Distributed Q-Learning: A Scalable Reinforcement Learning Approach for Dynamic Treatment Regimes
Di Wang, Yao Wang, Shao-Bo Lin
TL;DR
This work tackles scalable dynamic treatment regimes (DTRs) by marrying kernel ridge regression with distributed learning to form DKRR-DTR, enabling offline, data-rich reinforcement learning on large electronic health records. The authors present a stage-wise Q-learning formulation in RKHS, develop a divide-and-conquer paradigm to reduce computational cost from cubic to per-subset cubic, and introduce an integral-operator analytic framework to obtain generalization bounds that do not depend on input dimension. Theoretical results show a convergence rate of $O(|D|^{-r/(2r+s)})$ under mild regularity and capacity assumptions, with the horizon $T$ entering only through constants, not the rate itself. Empirically, DKRR-DTR and its distributed variant DKRR-DTR achieve competitive or superior reward outcomes compared with linear Q-learning and deep RL baselines, while significantly reducing training time, especially on large-scale datasets. The combination of solid theory and practical efficiency makes DKRR-DTR a promising approach for safe, scalable medical decision support in the era of big EHR data.
Abstract
In recent years, large amounts of electronic health records (EHRs) concerning chronic diseases have been collected to facilitate medical diagnosis. Modeling the dynamic properties of EHRs related to chronic diseases can be efficiently done using dynamic treatment regimes (DTRs). While reinforcement learning (RL) is a widely used method for creating DTRs, there is ongoing research in developing RL algorithms that can effectively handle large amounts of data. In this paper, we present a scalable kernel-based distributed Q-learning algorithm for generating DTRs. We perform both theoretical assessments and numerical analysis for the proposed approach. The results demonstrate that our algorithm significantly reduces the computational complexity associated with the state-of-the-art deep reinforcement learning methods, while maintaining comparable generalization performance in terms of accumulated rewards across stages, such as survival time or cumulative survival probability.
