Table of Contents
Fetching ...

A Principle-based Framework for the Development and Evaluation of Large Language Models for Health and Wellness

Brent Winslow, Jacqueline Shreibati, Javier Perez, Hao-Wei Su, Nichole Young-Lin, Nova Hammerquist, Daniel McDuff, Jason Guss, Jenny Vafeiadou, Nick Cain, Alex Lin, Erik Schenck, Shiva Rajagopal, Jia-Ru Chung, Anusha Venkatakrishnan, Amy Armento Lee, Maryam Karimzadehgan, Qingyou Meng, Rythm Agarwal, Aravind Natarajan, Tracy Giest

TL;DR

This work introduces a principle-based SHARP framework (Safety, Helpfulness, Accuracy, Relevance, Personalization) for evaluating large language models in personal health and wellness, paired with the Fitbit Insights explorer. It combines human, autorater, and adversarial evaluations across a real-world, staged deployment with thousands of consented users, to identify risks and guide targeted improvements. The study demonstrates how a structured, end-to-end evaluation process—integrating datasets, guidelines, training, and iterative deployment—can yield safer, more reliable, and more personalized health AI experiences. It argues for a standardized, adaptable methodology to enable responsible innovation of consumer health AI while prioritizing user safety and trust.

Abstract

The incorporation of generative artificial intelligence into personal health applications presents a transformative opportunity for personalized, data-driven health and fitness guidance, yet also poses challenges related to user safety, model accuracy, and personal privacy. To address these challenges, a novel, principle-based framework was developed and validated for the systematic evaluation of LLMs applied to personal health and wellness. First, the development of the Fitbit Insights explorer, a large language model (LLM)-powered system designed to help users interpret their personal health data, is described. Subsequently, the safety, helpfulness, accuracy, relevance, and personalization (SHARP) principle-based framework is introduced as an end-to-end operational methodology that integrates comprehensive evaluation techniques including human evaluation by generalists and clinical specialists, autorater assessments, and adversarial testing, into an iterative development lifecycle. Through the application of this framework to the Fitbit Insights explorer in a staged deployment involving over 13,000 consented users, challenges not apparent during initial testing were systematically identified. This process guided targeted improvements to the system and demonstrated the necessity of combining isolated technical evaluations with real-world user feedback. Finally, a comprehensive, actionable approach is established for the responsible development and deployment of LLM-powered health applications, providing a standardized methodology to foster innovation while ensuring emerging technologies are safe, effective, and trustworthy for users.

A Principle-based Framework for the Development and Evaluation of Large Language Models for Health and Wellness

TL;DR

This work introduces a principle-based SHARP framework (Safety, Helpfulness, Accuracy, Relevance, Personalization) for evaluating large language models in personal health and wellness, paired with the Fitbit Insights explorer. It combines human, autorater, and adversarial evaluations across a real-world, staged deployment with thousands of consented users, to identify risks and guide targeted improvements. The study demonstrates how a structured, end-to-end evaluation process—integrating datasets, guidelines, training, and iterative deployment—can yield safer, more reliable, and more personalized health AI experiences. It argues for a standardized, adaptable methodology to enable responsible innovation of consumer health AI while prioritizing user safety and trust.

Abstract

The incorporation of generative artificial intelligence into personal health applications presents a transformative opportunity for personalized, data-driven health and fitness guidance, yet also poses challenges related to user safety, model accuracy, and personal privacy. To address these challenges, a novel, principle-based framework was developed and validated for the systematic evaluation of LLMs applied to personal health and wellness. First, the development of the Fitbit Insights explorer, a large language model (LLM)-powered system designed to help users interpret their personal health data, is described. Subsequently, the safety, helpfulness, accuracy, relevance, and personalization (SHARP) principle-based framework is introduced as an end-to-end operational methodology that integrates comprehensive evaluation techniques including human evaluation by generalists and clinical specialists, autorater assessments, and adversarial testing, into an iterative development lifecycle. Through the application of this framework to the Fitbit Insights explorer in a staged deployment involving over 13,000 consented users, challenges not apparent during initial testing were systematically identified. This process guided targeted improvements to the system and demonstrated the necessity of combining isolated technical evaluations with real-world user feedback. Finally, a comprehensive, actionable approach is established for the responsible development and deployment of LLM-powered health applications, providing a standardized methodology to foster innovation while ensuring emerging technologies are safe, effective, and trustworthy for users.

Paper Structure

This paper contains 12 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: How Fitbit Insights explorer system responds to a user query. Incoming queries are routed to the query understanding module to determine the data and analyses required. The date understanding module determines the start and end date of the data needed. Associated data is routed to the code generation and execution module to generate additional analyses if needed. The analysis results from both analysis tools and code generation are incorporated into the explanation module with the system prompt, and the response is provided for the query.
  • Figure 2: Ask Coach user flow application screenshots. User queries received responses with charts, plots, individualized recommendations and suggested follow-up queries.
  • Figure 3: Generative AI models and experiences are evaluated across 3 major steps, which include: 1) preparation - focused on designing the evaluation, defining key performance indicators (KPIs), preparing relevant datasets, developing guidelines for the evaluation, and assigning and training raters; 2) evaluation - focused on implementing an evaluation toolkit which may consist of applying auto-evaluation, human evaluation, as well as safety and red-teaming; and 3) review - focused on rapidly delivering actionable insights and KPI performance for post-market model improvement or post-market monitoring. The entire framework is founded on the SHARP principles of safety, helpfulness, accuracy, relevance, and personalization.
  • Figure 4: Effect of guidelines, type of scales, and rater training on inter-rater reliability. Written guidelines significantly improved inter-rater reliability as assessed using Krippendorff’s alpha (Krippendorff’s alpha median: Guidelines = 0.75; No Guidelines = 0.05; p = 0.0001). Boolean rating scales yield slightly higher but not statistically significant reliability than Likert scales (Krippendorff’s alpha median: Boolean = 0.28; Likert = 0.21; p = 0.151). Document-based training and interactive training significantly increased reliability over no training (Krippendorff’s alpha median: no training = 0.22; document-only training = 0.32; p = 0.00036; no training = 0.22; interactive training = 0.80; p = 0.0033).