Flexible Blood Glucose Control: Offline Reinforcement Learning from Human Feedback
Harry Emerson, Sam Gordon James, Matthew Guy, Ryan McConville
TL;DR
PAINT tackles the lack of patient-guided customization in RL-based glucose control by learning a patient-specific reward function through reward sketching and applying a safety-constrained offline RL algorithm derived from TD3+BC. The method leverages a verifiably safe prior policy and a tunable preference strength parameter $\lambda$ to balance safety and personalization, demonstrated in the UVA/Padova T1D simulator with three virtual patients. In silico results show a 15% reduction in glycemic risk and improvements in meal and device-error handling when patient preferences are provided, with robust performance under limited data, labeling noise, and intra-patient variability. PAINT represents a first step toward real-world, patient-customizable RL glucose controllers and could generalize to other safety-critical domains requiring rapid, user-guided preference learning under constraints.
Abstract
Reinforcement learning (RL) has demonstrated success in automating insulin dosing in simulated type 1 diabetes (T1D) patients but is currently unable to incorporate patient expertise and preference. This work introduces PAINT (Preference Adaptation for INsulin control in T1D), an original RL framework for learning flexible insulin dosing policies from patient records. PAINT employs a sketch-based approach for reward learning, where past data is annotated with a continuous reward signal to reflect patient's desired outcomes. Labelled data trains a reward model, informing the actions of a novel safety-constrained offline RL algorithm, designed to restrict actions to a safe strategy and enable preference tuning via a sliding scale. In-silico evaluation shows PAINT achieves common glucose goals through simple labelling of desired states, reducing glycaemic risk by 15% over a commercial benchmark. Action labelling can also be used to incorporate patient expertise, demonstrating an ability to pre-empt meals (+10% time-in-range post-meal) and address certain device errors (-1.6% variance post-error) with patient guidance. These results hold under realistic conditions, including limited samples, labelling errors, and intra-patient variability. This work illustrates PAINT's potential in real-world T1D management and more broadly any tasks requiring rapid and precise preference learning under safety constraints.
