Table of Contents
Fetching ...

Combining LLM decision and RL action selection to improve RL policy for adaptive interventions

Karine Karine, Benjamin M. Marlin

TL;DR

The paper tackles accelerating personalization in health adaptive interventions by integrating LLM-based user-preference processing with Bayesian RL (Thompson Sampling) to speed up policy adaptation. It introduces a hybrid method, LLM+TS, where an LLM filters the RL candidate action to produce a hybrid action $\tilde{a}\in\{0,a\}$, enabling immediate incorporation of text-based preferences. A novel simulation environment, StepCountJITAI for LLM, generates text-based preferences via a binary hidden state $W\in\{0,1\}$ and models constraints that shape behavior, allowing rigorous offline evaluation of LLM-enabled policies. Empirical results show LLM+TS outperforms standard TS across multiple scenarios and LLM implementations, supporting its potential for real-time, user-aligned personalization in adaptive health interventions.

Abstract

Reinforcement learning (RL) is increasingly being used in the healthcare domain, particularly for the development of personalized health adaptive interventions. Inspired by the success of Large Language Models (LLMs), we are interested in using LLMs to update the RL policy in real time, with the goal of accelerating personalization. We use the text-based user preference to influence the action selection on the fly, in order to immediately incorporate the user preference. We use the term "user preference" as a broad term to refer to a user personal preference, constraint, health status, or a statement expressing like or dislike, etc. Our novel approach is a hybrid method that combines the LLM response and the RL action selection to improve the RL policy. Given an LLM prompt that incorporates the user preference, the LLM acts as a filter in the typical RL action selection. We investigate different prompting strategies and action selection strategies. To evaluate our approach, we implement a simulation environment that generates the text-based user preferences and models the constraints that impact behavioral dynamics. We show that our approach is able to take into account the text-based user preferences, while improving the RL policy, thus improving personalization in adaptive intervention.

Combining LLM decision and RL action selection to improve RL policy for adaptive interventions

TL;DR

The paper tackles accelerating personalization in health adaptive interventions by integrating LLM-based user-preference processing with Bayesian RL (Thompson Sampling) to speed up policy adaptation. It introduces a hybrid method, LLM+TS, where an LLM filters the RL candidate action to produce a hybrid action , enabling immediate incorporation of text-based preferences. A novel simulation environment, StepCountJITAI for LLM, generates text-based preferences via a binary hidden state and models constraints that shape behavior, allowing rigorous offline evaluation of LLM-enabled policies. Empirical results show LLM+TS outperforms standard TS across multiple scenarios and LLM implementations, supporting its potential for real-time, user-aligned personalization in adaptive health interventions.

Abstract

Reinforcement learning (RL) is increasingly being used in the healthcare domain, particularly for the development of personalized health adaptive interventions. Inspired by the success of Large Language Models (LLMs), we are interested in using LLMs to update the RL policy in real time, with the goal of accelerating personalization. We use the text-based user preference to influence the action selection on the fly, in order to immediately incorporate the user preference. We use the term "user preference" as a broad term to refer to a user personal preference, constraint, health status, or a statement expressing like or dislike, etc. Our novel approach is a hybrid method that combines the LLM response and the RL action selection to improve the RL policy. Given an LLM prompt that incorporates the user preference, the LLM acts as a filter in the typical RL action selection. We investigate different prompting strategies and action selection strategies. To evaluate our approach, we implement a simulation environment that generates the text-based user preferences and models the constraints that impact behavioral dynamics. We show that our approach is able to take into account the text-based user preferences, while improving the RL policy, thus improving personalization in adaptive intervention.
Paper Structure (14 sections, 4 equations, 5 figures, 3 tables)

This paper contains 14 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of LLM+TS method. LLM+TS is a hybrid method that combines LLM decision and RL action selection to improve the RL policy. The LLM prompt includes information such as a description of the behavioral dynamics, current user state and some past data, user preference (constraint) and a question asking the LLM to decide: "not send" or "send" a message (i.e., $\tilde{a}=0$ or $\tilde{a}=a$). The LLM acts as a filter in the typical RL action selection.
  • Figure 2: Markov chain sketch.
  • Figure 3: LLM+TS vs. standard TS example scenarios showing that LLM+TS outperforms standard TS on most settings.
  • Figure 4: LLM+TS vs. standard TS. Example of histogram for all the selected actions, and plot of the cumulative rewards for $(p_{w_{11}}, p_{w_{00}}) = (0.7, 0.1)$. The histograms show that LLM+TS is able to capture a larger number of actions $0$, which indicates that the LLM has correctly decided to not send a message when the user cannot walk. The cumulative reward plots show that LLM+TS outperforms standard TS.
  • Figure 5: LLM+TS vs. standard TS. Example of histogram for all the selected actions, and plot of the cumulative rewards for various $(p_{w_{11}}, p_{w_{00}})$ with fixed $p_{w_{11}}=0.7$ and varying $p_{w_{00}}$, when using LLM+TS (blue) and standard TS (gray). The histograms show that LLM+TS is able to capture a larger number of actions $0$, which indicates that the LLM has correctly decided to not send a message when the user cannot walk. The cumulative reward plots show that LLM+TS outperforms standard TS.