Table of Contents
Fetching ...

Generalized Fitted Q-Iteration with Clustered Data

Liyuan Hu, Jitao Wang, Zhenke Wu, Chengchun Shi

TL;DR

Addresses reinforcement learning with cluster-structured data and proposes GFQI that integrates generalized estimating equations to handle intra-cluster correlations. The method yields theoretical guarantees of estimator optimality under correct correlation specification and consistency under mis-specification, and demonstrates substantial empirical gains (approximately 50% average regret reduction, up to 80% under strong correlations) on simulations and mobile-health analyses. By accounting for cluster structure, GFQI improves sample efficiency for policy learning in healthcare and other domains with clustered data.

Abstract

This paper focuses on reinforcement learning (RL) with clustered data, which is commonly encountered in healthcare applications. We propose a generalized fitted Q-iteration (FQI) algorithm that incorporates generalized estimating equations into policy learning to handle the intra-cluster correlations. Theoretically, we demonstrate (i) the optimalities of our Q-function and policy estimators when the correlation structure is correctly specified, and (ii) their consistencies when the structure is mis-specified. Empirically, through simulations and analyses of a mobile health dataset, we find the proposed generalized FQI achieves, on average, a half reduction in regret compared to the standard FQI.

Generalized Fitted Q-Iteration with Clustered Data

TL;DR

Addresses reinforcement learning with cluster-structured data and proposes GFQI that integrates generalized estimating equations to handle intra-cluster correlations. The method yields theoretical guarantees of estimator optimality under correct correlation specification and consistency under mis-specification, and demonstrates substantial empirical gains (approximately 50% average regret reduction, up to 80% under strong correlations) on simulations and mobile-health analyses. By accounting for cluster structure, GFQI improves sample efficiency for policy learning in healthcare and other domains with clustered data.

Abstract

This paper focuses on reinforcement learning (RL) with clustered data, which is commonly encountered in healthcare applications. We propose a generalized fitted Q-iteration (FQI) algorithm that incorporates generalized estimating equations into policy learning to handle the intra-cluster correlations. Theoretically, we demonstrate (i) the optimalities of our Q-function and policy estimators when the correlation structure is correctly specified, and (ii) their consistencies when the structure is mis-specified. Empirically, through simulations and analyses of a mobile health dataset, we find the proposed generalized FQI achieves, on average, a half reduction in regret compared to the standard FQI.

Paper Structure

This paper contains 17 sections, 2 theorems, 42 equations, 4 figures, 1 table, 3 algorithms.

Key Result

Theorem 1

Suppose Assumptions as:bounded reward-as:unique optimal pi are satisfied. For a sufficiently large sample size $N$, $\widehat{\beta}$ computed by Algorithm alg:optimal_fqi attains the following properties:

Figures (4)

  • Figure 1: Left panel: Caterpillar plot of random effects for institutions with at least $5$ interns. The error bars indicate 95% confidence intervals. Right panel: Average reward of policies computed by standard FQI (colored in blue) and the proposed generalized FQI (colored in red) with increasing number of clusters. The horizontal line (colored in cyan) depicts the optimal value, computed by an online deep Q-network (DQN) agent with sufficiently many data. Both the number of subjects per cluster and the time horizon are fixed to $5$.
  • Figure 2: Left panel: Visualization of MDPs. $(S_t^{(1)}, A_t^{(1)},R_t^{(1)})_{t\ge 1}$ and $(S_t^{(2)}, A_t^{(2)},R_t^{(2)})_{t\ge 1}$ denote two data trajectories. The Markov assumption precludes paths $S_{t-1}^{(1)}\to S_{t+1}^{(1)}$ and $S_{t-1}^{(2)}\to S_{t+1}^{(2)}$. The independence assumption precludes paths across the two trajectories. Right panel: A clustered MDP example.
  • Figure 3: Regret of the average reward with varying (i) numbers of clusters, (ii) cluster sizes, (iii) numbers of days and (iv) values of intra-cluster correlation parameter ($\psi$). The shaded band represents the standard error. The green line (AGTD) and blue line (FQI) are largely overlapped.
  • Figure 4: Change in regret of the average reward with varying cluster size, time horizon or number of clusters. The band represents the standard error calculated with respect to the random seed after running experiments 50 times.

Theorems & Definitions (2)

  • Theorem 1: MSE
  • Theorem 2