Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data

Yubin Kim; Xuhai Xu; Daniel McDuff; Cynthia Breazeal; Hae Won Park

Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data

Yubin Kim, Xuhai Xu, Daniel McDuff, Cynthia Breazeal, Hae Won Park

TL;DR

Health-LLM investigates how large language models can predict health outcomes from wearable sensor time-series by grounding non-linguistic data in prompts and through targeted fine-tuning. The authors evaluate 12 open LLMs across four public health datasets, using zero-shot, few-shot (with CoT/SC), and instruction-tuning with PEFT, and introduce HealthAlpaca, a fine-tuned model that matches or exceeds much larger models on most tasks. A key finding is that context-enhanced prompts—especially health knowledge contexts—and selective fine-tuning yield substantial performance gains, with up to 23.8% improvements in some cases. The work also documents the capabilities and limits of LLMs in health reasoning, highlights cross-dataset generalization patterns, and releases HealthAlpaca as an open-resource baseline for consumer health prediction tasks, informing future development of clinically grounded, privacy-conscious health-AI systems.

Abstract

Large language models (LLMs) are capable of many natural language tasks, yet they are far from perfect. In health applications, grounding and interpreting domain-specific and non-linguistic data is crucial. This paper investigates the capacity of LLMs to make inferences about health based on contextual information (e.g. user demographics, health knowledge) and physiological data (e.g. resting heart rate, sleep minutes). We present a comprehensive evaluation of 12 state-of-the-art LLMs with prompting and fine-tuning techniques on four public health datasets (PMData, LifeSnaps, GLOBEM and AW_FB). Our experiments cover 10 consumer health prediction tasks in mental health, activity, metabolic, and sleep assessment. Our fine-tuned model, HealthAlpaca exhibits comparable performance to much larger models (GPT-3.5, GPT-4 and Gemini-Pro), achieving the best performance in 8 out of 10 tasks. Ablation studies highlight the effectiveness of context enhancement strategies. Notably, we observe that our context enhancement can yield up to 23.8% improvement in performance. While constructing contextually rich prompts (combining user context, health knowledge and temporal information) exhibits synergistic improvement, the inclusion of health knowledge context in prompts significantly enhances overall performance.

Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data

TL;DR

Abstract

Paper Structure (33 sections, 1 equation, 9 figures, 16 tables)

This paper contains 33 sections, 1 equation, 9 figures, 16 tables.

Introduction
Related Work
Wearable Sensor Data with LLMs
Health LLMs
Methods
Zero-shot Prompting
Few-shot Prompting
Instruction Tuning
Temporal Encoding Methods
Experiment
Datasets and Tasks
PMData 10.1145/3339825.3394926
LifeSnaps yfantidou2022lifesnaps
GLOBEM xu2022globem
AW_FB DVN/ZS2Z2J_2020
...and 18 more sections

Figures (9)

Figure 1: Health-LLM. We present a framework for evaluating LLM performance on a diverse set of health prediction tasks, training and prompting the models with multi-modal health data.
Figure 2: (a): Average Performance Improvement over basic (bs) across contexts. (b): Best Performance Improvement across LLMs. (c): Best Performance Improvement across Datasets. Note that few models (Llama 2, Gemini-Pro, BioMedGPT and BioMistral) were excluded in this experiment due to the prioritization of models based on integration timelines.
Figure 3: A Case Study on Readiness Score Prediction (READ) from PMData dataset. Here, we display the responses from 1) our fine-tuned model, HealthAlpaca, 2) GPT-3.5, 3) GPT-4 and 4) Gemini-Pro. Green Bolded texts highlights the valid reasoning and Red Bolded texts highlights the false or irrelevant reasoning to the input.
Figure 4: A Case Study on Sleep Disorder Prediction (SQ) from LifeSnaps dataset. Here, we display the responses from 1) our fine-tuned model, HealthAlpaca, 2) GPT-3.5, 3) GPT-4 and 4) Gemini-Pro. Green Bolded texts highlights the valid reasoning.
Figure 5: Health Prediction Performance of Fully fine-tuned MedAlpaca with Different Training Sizes. The instruction fine-tuning is conducted across ten tasks across four datasets. The solid lines represents the fully fine-tuned model's performance whereas the dashed lines represents the zero-shot performance of MedAlpaca which serves as baselines. Note that the color indicates the metrics used to evaluate the tasks.
...and 4 more figures

Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data

TL;DR

Abstract

Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data

Authors

TL;DR

Abstract

Table of Contents

Figures (9)