Table of Contents
Fetching ...

Can we use LLMs to bootstrap reinforcement learning? -- A case study in digital health behavior change

Nele Albers, Esra Cemre Su de Groot, Loes Keijsers, Manon H. Hillegers, Emiel Krahmer

TL;DR

This work tackles bootstrapping reinforcement learning for digital health behavior change in the absence of real user data by using out-of-the-box LLMs to generate interaction samples that take the form $\langle s_t, r_t, a_t, s_{t+1} \rangle$. An offline RL pipeline trained on these LLM-derived samples is evaluated against four health-behavior studies, comparing policies and learned dynamics to those derived from real data and human raters. The findings show that LLM-generated samples can yield policies that beat worst and random baselines and, in many cases, approach human performance, while dynamics estimates (rewards and transitions) are closer to real data than simple baselines. The study also reveals that prompting strategies (short vs extensive prompts, few-shot, chain-of-thought) yield results that are highly contingent on the study and model, with few-shot prompting offering more consistent gains, and chain-of-thought prompting lacking a uniform benefit; overall, Llama-3.3-70B frequently performs well. The authors provide practical recommendations for integrating LLM-generated samples with real data to bootstrap RL design in digital health applications.

Abstract

Personalizing digital applications for health behavior change is a promising route to making them more engaging and effective. This especially holds for approaches that adapt to users and their specific states (e.g., motivation, knowledge, wants) over time. However, developing such approaches requires making many design choices, whose effectiveness is difficult to predict from literature and costly to evaluate in practice. In this work, we explore whether large language models (LLMs) can be used out-of-the-box to generate samples of user interactions that provide useful information for training reinforcement learning models for digital behavior change settings. Using real user data from four large behavior change studies as comparison, we show that LLM-generated samples can be useful in the absence of real data. Comparisons to the samples provided by human raters further show that LLM-generated samples reach the performance of human raters. Additional analyses of different prompting strategies including shorter and longer prompt variants, chain-of-thought prompting, and few-shot prompting show that the relative effectiveness of different strategies depends on both the study and the LLM with also relatively large differences between prompt paraphrases alone. We provide recommendations for how LLM-generated samples can be useful in practice.

Can we use LLMs to bootstrap reinforcement learning? -- A case study in digital health behavior change

TL;DR

This work tackles bootstrapping reinforcement learning for digital health behavior change in the absence of real user data by using out-of-the-box LLMs to generate interaction samples that take the form . An offline RL pipeline trained on these LLM-derived samples is evaluated against four health-behavior studies, comparing policies and learned dynamics to those derived from real data and human raters. The findings show that LLM-generated samples can yield policies that beat worst and random baselines and, in many cases, approach human performance, while dynamics estimates (rewards and transitions) are closer to real data than simple baselines. The study also reveals that prompting strategies (short vs extensive prompts, few-shot, chain-of-thought) yield results that are highly contingent on the study and model, with few-shot prompting offering more consistent gains, and chain-of-thought prompting lacking a uniform benefit; overall, Llama-3.3-70B frequently performs well. The authors provide practical recommendations for integrating LLM-generated samples with real data to bootstrap RL design in digital health applications.

Abstract

Personalizing digital applications for health behavior change is a promising route to making them more engaging and effective. This especially holds for approaches that adapt to users and their specific states (e.g., motivation, knowledge, wants) over time. However, developing such approaches requires making many design choices, whose effectiveness is difficult to predict from literature and costly to evaluate in practice. In this work, we explore whether large language models (LLMs) can be used out-of-the-box to generate samples of user interactions that provide useful information for training reinforcement learning models for digital behavior change settings. Using real user data from four large behavior change studies as comparison, we show that LLM-generated samples can be useful in the absence of real data. Comparisons to the samples provided by human raters further show that LLM-generated samples reach the performance of human raters. Additional analyses of different prompting strategies including shorter and longer prompt variants, chain-of-thought prompting, and few-shot prompting show that the relative effectiveness of different strategies depends on both the study and the LLM with also relatively large differences between prompt paraphrases alone. We provide recommendations for how LLM-generated samples can be useful in practice.

Paper Structure

This paper contains 10 sections, 24 figures, 3 tables.

Figures (24)

  • Figure 1: Illustration of the task of predicting next states for the behavior change study on deciding whether to add human feedback when preparing for quitting smoking. After receiving general information on the task and behavior change study, human raters were given scenarios describing an imaginary user's state (described by three variables illustrated with a bar chart) and a chosen action (receiving human feedback) for which to predict the next state. The information given to human raters is analogous to the one given to LLMs.
  • Figure 2: Overview of our offline RL approach applied to four different behavior change studies. Data collection: In the absence of real behavioral data, data is collected by prompting an LLM to imagine a user in state $s_t$ receiving action $a_t$. Training: The data is used to train an RL agent. Deployment: The trained agent is tested in a simulation of real users created using previously collected real behavioral data.
  • Figure 3: Example prompt for the reward for study 3 on human feedback for quitting smoking. The part in italics is only shown for the extensive prompt version.
  • Figure 4: Simulated performance of policies learned from samples generated by different LLMs compared to policies learned from the real behavioral samples as well as human-generated samples. For each of the four behavior change studies described in Table \ref{['tab:overview_studies']}, the policies are evaluated on the study-specific evaluation criterion (y-axis) over time (x-axis). The optimal policy $\pi^*$, no-learned-dynamics policy, and the worst policy $\pi^-$ are learned based on the real behavioral data. The human policy $\pi^H$ is learned based on the human-generated samples.
  • Figure 5: Mean $L_1$-error and 95% credible interval between the rewards estimated from samples generated by different LLMs and those estimated from the real behavioral samples for different numbers of samples per action. As comparisons serves assuming that people spend the mean reward for each state-action combination (Mean reward) as well as drawing samples from the real behavioral samples (Oracle). Moreover, we computed the reward function from the human-generated samples (Human). Means are shown over the 10 different prompt variants for LLM-generated samples and over 10 random draws for the real behavioral samples.
  • ...and 19 more figures