Table of Contents
Fetching ...

Off-Policy Selection for Initiating Human-Centric Experimental Design

Ge Gao, Xi Yang, Qitong Gao, Song Ju, Miroslav Pajic, Min Chi

TL;DR

This work introduces First-Glance Off-Policy Selection (FPS), a novel approach that systematically addresses participant heterogeneity through sub-group segmentation and tailored OPS criteria to each sub-group by grouping individuals with similar traits.

Abstract

In human-centric tasks such as healthcare and education, the heterogeneity among patients and students necessitates personalized treatments and instructional interventions. While reinforcement learning (RL) has been utilized in those tasks, off-policy selection (OPS) is pivotal to close the loop by offline evaluating and selecting policies without online interactions, yet current OPS methods often overlook the heterogeneity among participants. Our work is centered on resolving a pivotal challenge in human-centric systems (HCSs): how to select a policy to deploy when a new participant joining the cohort, without having access to any prior offline data collected over the participant? We introduce First-Glance Off-Policy Selection (FPS), a novel approach that systematically addresses participant heterogeneity through sub-group segmentation and tailored OPS criteria to each sub-group. By grouping individuals with similar traits, FPS facilitates personalized policy selection aligned with unique characteristics of each participant or group of participants. FPS is evaluated via two important but challenging applications, intelligent tutoring systems and a healthcare application for sepsis treatment and intervention. FPS presents significant advancement in enhancing learning outcomes of students and in-hospital care outcomes.

Off-Policy Selection for Initiating Human-Centric Experimental Design

TL;DR

This work introduces First-Glance Off-Policy Selection (FPS), a novel approach that systematically addresses participant heterogeneity through sub-group segmentation and tailored OPS criteria to each sub-group by grouping individuals with similar traits.

Abstract

In human-centric tasks such as healthcare and education, the heterogeneity among patients and students necessitates personalized treatments and instructional interventions. While reinforcement learning (RL) has been utilized in those tasks, off-policy selection (OPS) is pivotal to close the loop by offline evaluating and selecting policies without online interactions, yet current OPS methods often overlook the heterogeneity among participants. Our work is centered on resolving a pivotal challenge in human-centric systems (HCSs): how to select a policy to deploy when a new participant joining the cohort, without having access to any prior offline data collected over the participant? We introduce First-Glance Off-Policy Selection (FPS), a novel approach that systematically addresses participant heterogeneity through sub-group segmentation and tailored OPS criteria to each sub-group. By grouping individuals with similar traits, FPS facilitates personalized policy selection aligned with unique characteristics of each participant or group of participants. FPS is evaluated via two important but challenging applications, intelligent tutoring systems and a healthcare application for sepsis treatment and intervention. FPS presents significant advancement in enhancing learning outcomes of students and in-hospital care outcomes.

Paper Structure

This paper contains 40 sections, 1 theorem, 12 equations, 4 figures, 5 tables, 1 algorithm.

Key Result

Proposition 2.5

Define the estimator $\hat{D}^{\pi,\beta}_{K_m}$ as, i.e., here, $\mathcal{I}_m$ follows the definition above, which is the set of participants grouped in $K_m$; $\omega_i=\Pi_{t=1}^{T} {\pi(a^{(i)}_t|s^{(i)}_t) / \beta(a^{(i)}_t|s^{(i)}_t)}$ is the IS weight for the i-th trajectory in the offline dataset; $s^{(i)}_t, a^{(i)}_t, r^{(i)}_t$ are the states, a with $ESS$ being the effective sample s

Figures (4)

  • Figure 1: Analysis of main results from the real-world IE experiment. (a) Overall performance of the 6-th semester's student cohort. Methods that selected the same policy are merged in one bin, i.e., all refers to all three variations (raw, +RRS, +VRRS) of the existing OPS baselines. (b) Estimated and true policy performance using each method. For OPE, OPE+RRS, OPE+VRRS, results with the least gap between estimated and true rewards among OPE methods (i.e., WIS, FQE+RRS, and FQE+VRRS, respectively) are shown in the figure. True reward refers to the returns averaged over the cohort of the 6-th semester, obtained by deploying the policy selected for each student correspondingly.
  • Figure 2: Performance of students over all four sub-groups under selected policies in the 6-th semester.
  • Figure 3: Graphical user interface (GUI) of the IE system. The problem statement window (top) presents the statement of the problem. The dialog window (middle right) shows the message the tutor provides to the students. Responses, e.g., writing an equation, are entered in the response window (bottom right). Any variables and equations generated through this process are shown on the variable window (middle left) and equation window (bottom left).
  • Figure 4: Mean absolute error (MAE) of OPE AugRRS with subgroup partitioning over problems in historical data.

Theorems & Definitions (4)

  • Definition 2.4: Value Function per Sub-group
  • Proposition 2.5
  • proof
  • proof