Table of Contents
Fetching ...

Large Language Models are Few-Shot Health Learners

Xin Liu, Daniel McDuff, Geza Kovacs, Isaac Galatzer-Levy, Jacob Sunshine, Jiening Zhan, Ming-Zher Poh, Shun Liao, Paolo Di Achille, Shwetak Patel

TL;DR

This work demonstrates that large language models can act as universal few-shot health learners when grounded with numerical time-series data from wearables and clinical sensors. By embedding physiologic data into textual prompts and tuning a soft prompt, a 24B PaLM model achieves substantial gains over zero-shot and supervised baselines across cardiovascular, activity, metabolic, and mental-health tasks. The study highlights the importance of context-rich prompts for enabling domain knowledge to inform health inferences and reveals limitations related to long time-series inputs and arithmetic challenges. Together, these findings suggest a promising direction for integrating LLMs with quantitative health data to support personalized monitoring and health analytics, while underscoring the need for careful evaluation and safety considerations.

Abstract

Large language models (LLMs) can capture rich representations of concepts that are useful for real-world tasks. However, language alone is limited. While existing LLMs excel at text-based inferences, health applications require that models be grounded in numerical data (e.g., vital signs, laboratory values in clinical domains; steps, movement in the wellness domain) that is not easily or readily expressed as text in existing training corpus. We demonstrate that with only few-shot tuning, a large language model is capable of grounding various physiological and behavioral time-series data and making meaningful inferences on numerous health tasks for both clinical and wellness contexts. Using data from wearable and medical sensor recordings, we evaluate these capabilities on the tasks of cardiac signal analysis, physical activity recognition, metabolic calculation (e.g., calories burned), and estimation of stress reports and mental health screeners.

Large Language Models are Few-Shot Health Learners

TL;DR

This work demonstrates that large language models can act as universal few-shot health learners when grounded with numerical time-series data from wearables and clinical sensors. By embedding physiologic data into textual prompts and tuning a soft prompt, a 24B PaLM model achieves substantial gains over zero-shot and supervised baselines across cardiovascular, activity, metabolic, and mental-health tasks. The study highlights the importance of context-rich prompts for enabling domain knowledge to inform health inferences and reveals limitations related to long time-series inputs and arithmetic challenges. Together, these findings suggest a promising direction for integrating LLMs with quantitative health data to support personalized monitoring and health analytics, while underscoring the need for careful evaluation and safety considerations.

Abstract

Large language models (LLMs) can capture rich representations of concepts that are useful for real-world tasks. However, language alone is limited. While existing LLMs excel at text-based inferences, health applications require that models be grounded in numerical data (e.g., vital signs, laboratory values in clinical domains; steps, movement in the wellness domain) that is not easily or readily expressed as text in existing training corpus. We demonstrate that with only few-shot tuning, a large language model is capable of grounding various physiological and behavioral time-series data and making meaningful inferences on numerous health tasks for both clinical and wellness contexts. Using data from wearable and medical sensor recordings, we evaluate these capabilities on the tasks of cardiac signal analysis, physical activity recognition, metabolic calculation (e.g., calories burned), and estimation of stress reports and mental health screeners.
Paper Structure (13 sections, 3 figures, 5 tables)

This paper contains 13 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Examples of question-answer pairs for our health tasks. In the prompts, data were represented numerically rather than graphically.
  • Figure 2: Our dataset construction and training configurations. We use physiological data to construct question-answer pairs. We create training, validation and test splits and compare zero-shot prediction performance to models created via prompt engineering and prompt tuning.
  • Figure 3: Results of 3, 10 and 25-shot experiments for nine health tasks.. Prompts include IBI sequences (in milliseconds) and the model is asked to provide instantaneous heart rate (A), average heart rate in beats per minute (B), presence or absence of atrial fibrillation (D), presence or absence of slowed heart rate/bradycardia (E) and elevated heart rate/Tachycardia (F). Accelerometer data into the prompt and the model is asked to classify walking or running (G); Fitbit data (e.g., steps, sleep hours) are included in the prompt and the model is asked to classify stress (H) and PHQ score (I). Data on exercise type, duration and weight are included in the prompt to estimate calories burned (C). The LLM with context-inclusive prompts outperforms the supervised baseline by up to 75%.