Table of Contents
Fetching ...

Flexible Blood Glucose Control: Offline Reinforcement Learning from Human Feedback

Harry Emerson, Sam Gordon James, Matthew Guy, Ryan McConville

TL;DR

PAINT tackles the lack of patient-guided customization in RL-based glucose control by learning a patient-specific reward function through reward sketching and applying a safety-constrained offline RL algorithm derived from TD3+BC. The method leverages a verifiably safe prior policy and a tunable preference strength parameter $\lambda$ to balance safety and personalization, demonstrated in the UVA/Padova T1D simulator with three virtual patients. In silico results show a 15% reduction in glycemic risk and improvements in meal and device-error handling when patient preferences are provided, with robust performance under limited data, labeling noise, and intra-patient variability. PAINT represents a first step toward real-world, patient-customizable RL glucose controllers and could generalize to other safety-critical domains requiring rapid, user-guided preference learning under constraints.

Abstract

Reinforcement learning (RL) has demonstrated success in automating insulin dosing in simulated type 1 diabetes (T1D) patients but is currently unable to incorporate patient expertise and preference. This work introduces PAINT (Preference Adaptation for INsulin control in T1D), an original RL framework for learning flexible insulin dosing policies from patient records. PAINT employs a sketch-based approach for reward learning, where past data is annotated with a continuous reward signal to reflect patient's desired outcomes. Labelled data trains a reward model, informing the actions of a novel safety-constrained offline RL algorithm, designed to restrict actions to a safe strategy and enable preference tuning via a sliding scale. In-silico evaluation shows PAINT achieves common glucose goals through simple labelling of desired states, reducing glycaemic risk by 15% over a commercial benchmark. Action labelling can also be used to incorporate patient expertise, demonstrating an ability to pre-empt meals (+10% time-in-range post-meal) and address certain device errors (-1.6% variance post-error) with patient guidance. These results hold under realistic conditions, including limited samples, labelling errors, and intra-patient variability. This work illustrates PAINT's potential in real-world T1D management and more broadly any tasks requiring rapid and precise preference learning under safety constraints.

Flexible Blood Glucose Control: Offline Reinforcement Learning from Human Feedback

TL;DR

PAINT tackles the lack of patient-guided customization in RL-based glucose control by learning a patient-specific reward function through reward sketching and applying a safety-constrained offline RL algorithm derived from TD3+BC. The method leverages a verifiably safe prior policy and a tunable preference strength parameter to balance safety and personalization, demonstrated in the UVA/Padova T1D simulator with three virtual patients. In silico results show a 15% reduction in glycemic risk and improvements in meal and device-error handling when patient preferences are provided, with robust performance under limited data, labeling noise, and intra-patient variability. PAINT represents a first step toward real-world, patient-customizable RL glucose controllers and could generalize to other safety-critical domains requiring rapid, user-guided preference learning under constraints.

Abstract

Reinforcement learning (RL) has demonstrated success in automating insulin dosing in simulated type 1 diabetes (T1D) patients but is currently unable to incorporate patient expertise and preference. This work introduces PAINT (Preference Adaptation for INsulin control in T1D), an original RL framework for learning flexible insulin dosing policies from patient records. PAINT employs a sketch-based approach for reward learning, where past data is annotated with a continuous reward signal to reflect patient's desired outcomes. Labelled data trains a reward model, informing the actions of a novel safety-constrained offline RL algorithm, designed to restrict actions to a safe strategy and enable preference tuning via a sliding scale. In-silico evaluation shows PAINT achieves common glucose goals through simple labelling of desired states, reducing glycaemic risk by 15% over a commercial benchmark. Action labelling can also be used to incorporate patient expertise, demonstrating an ability to pre-empt meals (+10% time-in-range post-meal) and address certain device errors (-1.6% variance post-error) with patient guidance. These results hold under realistic conditions, including limited samples, labelling errors, and intra-patient variability. This work illustrates PAINT's potential in real-world T1D management and more broadly any tasks requiring rapid and precise preference learning under safety constraints.

Paper Structure

This paper contains 32 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of reward sketching for T1D. A blood glucose profile (top) with meal insulin doses and consumed carbohydrates. Recorded basal inulin doses (middle) controlled by the insulin dosing device. Participant supplied reward labelling (bottom), specifying how close the dosing behaviour is to the desired goal at a given time. Participant draws a continuous reward signal under the historical data, highlighting desirable actions and states.
  • Figure 2: Full training pipeline for PAINT controller, showing the preference labelling procedure (top), and the offline RL training procedure (bottom). Users label a subset of their historical data, $D_{\text{pref}}$ in order to adapt the insulin dosing strategy to their individual needs. Reward labels, $r_{\text{pref}}$ are used to train a reward model, $r_\psi$, which is then used to label the full patient training dataset, $D'$. A generic policy, $\pi_{\text{priori}}$ is trained using a verifiably-safe reward function. $\pi_{\text{priori}}$ is then tuned using $D'$ to incorporate the user's preferences. The strength of the preference effect can be controlled via $\lambda$.
  • Figure 3: Insulin dosing behaviour with and without human feedback: a) incentivising pre-emptive insulin dosing prior to regular mealtimes and b) penalising erroneous drops in insulin during compression lows. Human feedback results in a 10% increase in TIR post-meal consumption and a 30% increased insulin dose during compression lows (more closely matching optimal behaviour). Error bars represent the standard error.
  • Figure 4: Robustness of PAINT to real-world challenges compared across three common T1D goals. PAINT is shown to be surprisingly effective under real-world constraints; achieving competitive results with $<$1,000 labelled samples (approximately 2 days of data), maintaining a performant agent with 80% corrupted training data, and demonstrating marginal performance reductions with reward labelling noise, even up to 10$\times$ standard deviation, $\sigma$. The dotted lines act as a benchmark and indicate the parameter value without human feedback. Error bars describe the standard error.
  • Figure 5: Nine different reward labelling strategies. Three for improving TIR, reducing TBR, and minimising CoV. $g_t$, $a_t$, and $\Delta g_t = \left(g_t - g_{t-30}\right)$ are the blood glucose, basal action, and successive difference in blood glucose, respectively. The functions are described mathematically in the Appendix.