Table of Contents
Fetching ...

Towards a Personal Health Large Language Model

Justin Cosentino, Anastasiya Belyaeva, Xin Liu, Nicholas A. Furlotte, Zhun Yang, Chace Lee, Erik Schenck, Yojan Patel, Jian Cui, Logan Douglas Schneider, Robby Bryant, Ryan G. Gomes, Allen Jiang, Roy Lee, Yun Liu, Javier Perez, Jameson K. Rogers, Cathy Speed, Shyam Tailor, Megan Walker, Jeffrey Yu, Tim Althoff, Conor Heneghan, John Hernandez, Mark Malhotra, Leor Stern, Yossi Matias, Greg S. Corrado, Shwetak Patel, Shravya Shetty, Jiening Zhan, Shruthi Prabhakara, Daniel McDuff, Cory Y. McLean

TL;DR

This work introduces PH-LLM, a Gemini-based personal health LLM tuned to reason over longitudinal wearable data for sleep and fitness. It builds three datasets to evaluate coaching insights, expert-domain knowledge, and patient-reported outcomes, and demonstrates that PH-LLM can approach expert performance on long-form sleep/fitness case studies and excel on professional exams. The study also shows that multimodal sensor encoding enables PRO prediction on par with discriminative models and introduces AutoEval for scalable model evaluation. Together, these results advance AI-assisted personalized health interactions while outlining safety and contextualization challenges for real-world deployment.

Abstract

In health, most large language model (LLM) research has focused on clinical tasks. However, mobile and wearable devices, which are rarely integrated into such tasks, provide rich, longitudinal data for personal health monitoring. Here we present Personal Health Large Language Model (PH-LLM), fine-tuned from Gemini for understanding and reasoning over numerical time-series personal health data. We created and curated three datasets that test 1) production of personalized insights and recommendations from sleep patterns, physical activity, and physiological responses, 2) expert domain knowledge, and 3) prediction of self-reported sleep outcomes. For the first task we designed 857 case studies in collaboration with domain experts to assess real-world scenarios in sleep and fitness. Through comprehensive evaluation of domain-specific rubrics, we observed that Gemini Ultra 1.0 and PH-LLM are not statistically different from expert performance in fitness and, while experts remain superior for sleep, fine-tuning PH-LLM provided significant improvements in using relevant domain knowledge and personalizing information for sleep insights. We evaluated PH-LLM domain knowledge using multiple choice sleep medicine and fitness examinations. PH-LLM achieved 79% on sleep and 88% on fitness, exceeding average scores from a sample of human experts. Finally, we trained PH-LLM to predict self-reported sleep quality outcomes from textual and multimodal encoding representations of wearable data, and demonstrate that multimodal encoding is required to match performance of specialized discriminative models. Although further development and evaluation are necessary in the safety-critical personal health domain, these results demonstrate both the broad knowledge and capabilities of Gemini models and the benefit of contextualizing physiological data for personal health applications as done with PH-LLM.

Towards a Personal Health Large Language Model

TL;DR

This work introduces PH-LLM, a Gemini-based personal health LLM tuned to reason over longitudinal wearable data for sleep and fitness. It builds three datasets to evaluate coaching insights, expert-domain knowledge, and patient-reported outcomes, and demonstrates that PH-LLM can approach expert performance on long-form sleep/fitness case studies and excel on professional exams. The study also shows that multimodal sensor encoding enables PRO prediction on par with discriminative models and introduces AutoEval for scalable model evaluation. Together, these results advance AI-assisted personalized health interactions while outlining safety and contextualization challenges for real-world deployment.

Abstract

In health, most large language model (LLM) research has focused on clinical tasks. However, mobile and wearable devices, which are rarely integrated into such tasks, provide rich, longitudinal data for personal health monitoring. Here we present Personal Health Large Language Model (PH-LLM), fine-tuned from Gemini for understanding and reasoning over numerical time-series personal health data. We created and curated three datasets that test 1) production of personalized insights and recommendations from sleep patterns, physical activity, and physiological responses, 2) expert domain knowledge, and 3) prediction of self-reported sleep outcomes. For the first task we designed 857 case studies in collaboration with domain experts to assess real-world scenarios in sleep and fitness. Through comprehensive evaluation of domain-specific rubrics, we observed that Gemini Ultra 1.0 and PH-LLM are not statistically different from expert performance in fitness and, while experts remain superior for sleep, fine-tuning PH-LLM provided significant improvements in using relevant domain knowledge and personalizing information for sleep insights. We evaluated PH-LLM domain knowledge using multiple choice sleep medicine and fitness examinations. PH-LLM achieved 79% on sleep and 88% on fitness, exceeding average scores from a sample of human experts. Finally, we trained PH-LLM to predict self-reported sleep quality outcomes from textual and multimodal encoding representations of wearable data, and demonstrate that multimodal encoding is required to match performance of specialized discriminative models. Although further development and evaluation are necessary in the safety-critical personal health domain, these results demonstrate both the broad knowledge and capabilities of Gemini models and the benefit of contextualizing physiological data for personal health applications as done with PH-LLM.
Paper Structure (42 sections, 19 figures, 44 tables)

This paper contains 42 sections, 19 figures, 44 tables.

Figures (19)

  • Figure 1: PH-LLM: A Personal Health Large Language Model.(A) We present PH-LLM, a version of Gemini fine-tuned for personal health and wellness. We evaluated PH-LLM on three aspects of personal health: generating personalized insights and recommendations for user goals in the domains of sleep and fitness, assessing levels of expert knowledge from certification examination style multiple choice questions, and predicting patient-reported outcomes in sleep quality from detailed sensor information. (B) Performance of PH-LLM contextualized with expert human responses. Error bars represent 95% confidence intervals. "$\ast$" indicates a statistically significant difference between two response types. "Naive Performance" is that achieved by a random classifier. Human expert performance is not available for patient-reported outcome prediction from sensor features as this is not commonly performed, and no fitness-related outcomes were measured in the study assessing patient-reported outcomes mcduff2024google.
  • Figure 2: Sleep case study example: wearable sensor data used as input and corresponding expert analysis and recommendations for improving sleep quality. The experts considered individual's demographics and wearable sensor data for up to 29 days including daily metrics of (A) bedtimes and wake times and (B) time spent in various sleep stages and awake. For all daily metrics considered, see Table \ref{['table:prompt_sleep_case_studies_sleep_logs']}. The experts also analyzed (C) aggregated statistics of various sleep metrics. For a full list of aggregated statistics, see Table \ref{['table:prompt_sleep_case_studies_sleep_summary']}. The experts composed responses based on the data including (D) insights about individual's sleep, potential etiology, and recommendations for improving sleep quality.
  • Figure 3: Fitness case study example: wearable sensor data used as input and corresponding expert analysis and recommendations. The experts considered individual's demographics and wearable sensor data over a 30-day period including daily metrics of (A) cardiovascular training load such as training impulse, (B) sleep metrics such as the bedtimes and wake times, and (C) health metrics such as resting heart rate, heart rate variation, and respiratory rate. For all daily and aggregated metrics considered, see Tables \ref{['table:prompt_fitness_case_studies_daily_activity']}-\ref{['table:prompt_fitness_case_studies_agg_health_metrics']}. The experts composed responses based on the data including (D) insights about individual's training load, sleep, health metrics, and provided a workout readiness assessment and fitness recommendations.
  • Figure 4: Case Study Human Evaluation Results. Mean ratings given by experts for the case study subsections across the (A) sleep and (B) fitness domains. "$*$" indicates a statistically significant difference between two response types after multiple hypothesis testing correction.
  • Figure 5: Prediction of Patient-Reported Outcomes by PH-LLM.(A) Correlation between survey responses for questions that measure related but distinct sleep outcomes from the PROMIS Sleep Disturbance and Sleep Impairment surveys. (B) Feature importance for sensor features predicting survey responses in a linear regression model. The top two predictors for each target, measured based on the magnitude of the regression coefficient, are annotated with "*". (C) Area under the receiver operator characteristic curve for performance of PH-LLM, zero-shot, and few-shot prompting approaches when predicting binary outcomes derived from survey responses. Cases where performance of PH-LLM w/ Adapter is significantly better than both Zero and Few Shot are annotated with "*". (D) Area under the precision-recall curve for performance of PH-LLM, zero-shot, and few-shot prompting approaches when predicting binary outcomes derived from survey responses. Survey response names are mapped to their corresponding questions in \ref{['sec:pro_surveys']}. "SI", sleep impairment.
  • ...and 14 more figures