Can we use LLMs to bootstrap reinforcement learning? -- A case study in digital health behavior change
Nele Albers, Esra Cemre Su de Groot, Loes Keijsers, Manon H. Hillegers, Emiel Krahmer
TL;DR
This work tackles bootstrapping reinforcement learning for digital health behavior change in the absence of real user data by using out-of-the-box LLMs to generate interaction samples that take the form $\langle s_t, r_t, a_t, s_{t+1} \rangle$. An offline RL pipeline trained on these LLM-derived samples is evaluated against four health-behavior studies, comparing policies and learned dynamics to those derived from real data and human raters. The findings show that LLM-generated samples can yield policies that beat worst and random baselines and, in many cases, approach human performance, while dynamics estimates (rewards and transitions) are closer to real data than simple baselines. The study also reveals that prompting strategies (short vs extensive prompts, few-shot, chain-of-thought) yield results that are highly contingent on the study and model, with few-shot prompting offering more consistent gains, and chain-of-thought prompting lacking a uniform benefit; overall, Llama-3.3-70B frequently performs well. The authors provide practical recommendations for integrating LLM-generated samples with real data to bootstrap RL design in digital health applications.
Abstract
Personalizing digital applications for health behavior change is a promising route to making them more engaging and effective. This especially holds for approaches that adapt to users and their specific states (e.g., motivation, knowledge, wants) over time. However, developing such approaches requires making many design choices, whose effectiveness is difficult to predict from literature and costly to evaluate in practice. In this work, we explore whether large language models (LLMs) can be used out-of-the-box to generate samples of user interactions that provide useful information for training reinforcement learning models for digital behavior change settings. Using real user data from four large behavior change studies as comparison, we show that LLM-generated samples can be useful in the absence of real data. Comparisons to the samples provided by human raters further show that LLM-generated samples reach the performance of human raters. Additional analyses of different prompting strategies including shorter and longer prompt variants, chain-of-thought prompting, and few-shot prompting show that the relative effectiveness of different strategies depends on both the study and the LLM with also relatively large differences between prompt paraphrases alone. We provide recommendations for how LLM-generated samples can be useful in practice.
