Table of Contents
Fetching ...

A Scalable Framework for Evaluating Health Language Models

Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow, Nova Hammerquist, Benjamin Graef, Cathy Speed, Mark Malhotra, Shwetak Patel, Javier L. Prieto, Daniel McDuff, Ahmed A. Metwally

TL;DR

This work introduces Adaptive Precise Boolean rubrics for evaluating health-focused LLMs, converting open-ended and Likert-style criteria into granular Yes/No checks to improve inter-rater reliability and enable automation. It develops Adaptive and Human-Adaptive variants that use data-driven or LLM-based rubric relevance to reduce evaluation burden while preserving signal quality, and demonstrates comparable or superior reliability to traditional methods. The framework is validated in metabolic health with real multi-modal data (biomarkers, wearables, and user context), showing improved detection of context-dependent response quality and greater efficiency, including a real-world WEAR-ME study with substantial auto-evaluation gains. The findings support scalable, cost-effective, and robust evaluation of health LLMs, highlighting implications for deployment controls, safety, and personalization in clinical-relevant settings.

Abstract

Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety. Current evaluation practices for open-ended text responses heavily rely on human experts. This approach introduces human factors and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions by identifying gaps in model responses using a minimal set of targeted rubrics questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield higher inter-rater agreement among expert and non-expert human evaluators, and in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency, particularly in automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health.

A Scalable Framework for Evaluating Health Language Models

TL;DR

This work introduces Adaptive Precise Boolean rubrics for evaluating health-focused LLMs, converting open-ended and Likert-style criteria into granular Yes/No checks to improve inter-rater reliability and enable automation. It develops Adaptive and Human-Adaptive variants that use data-driven or LLM-based rubric relevance to reduce evaluation burden while preserving signal quality, and demonstrates comparable or superior reliability to traditional methods. The framework is validated in metabolic health with real multi-modal data (biomarkers, wearables, and user context), showing improved detection of context-dependent response quality and greater efficiency, including a real-world WEAR-ME study with substantial auto-evaluation gains. The findings support scalable, cost-effective, and robust evaluation of health LLMs, highlighting implications for deployment controls, safety, and personalization in clinical-relevant settings.

Abstract

Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety. Current evaluation practices for open-ended text responses heavily rely on human experts. This approach introduces human factors and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions by identifying gaps in model responses using a minimal set of targeted rubrics questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield higher inter-rater agreement among expert and non-expert human evaluators, and in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency, particularly in automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health.

Paper Structure

This paper contains 48 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: A precision rubric for evaluating open-ended language responses to health queries. (A) A set of representative health queries and wearable data are used to construct inputs to the language model, which are then evaluated using our proposed evaluation rubric. (B) Our Results indicate that the proposed Precise Boolean rubric leads to higher agreement with humans (both experts and non-expert). Moreover, leveraging our proposed Adaptive Precise Boolean, the time being used in evaluation is halved the time used for the traditional Likert scale, while maintaining high rater agreement and high quality evaluation. (C) An example of a query and response highlighting references to specific relevant parts of the response. (D) Examples of evaluation rubrics for assessing the generated response to input query
  • Figure 2: Precise Boolean and Adaptive Precise Boolean rubrics increase the consistency between human evaluators (expert and non-expert), and human and automated evaluation. (A) Inter-rater correlation, as measured by intraclass correlation coefficient (\ref{['sec:methods']}, Section \ref{['sec:eval_metrics']}), between different subgroups (human evaluators “expert and non-expert” and automated evaluation, as measured by intraclass correlation coefficient (ICC). (B) Adaptive precise rubrics take about half the time needed to do evaluation compared to Likert scale questions.
  • Figure 3: Implications on average ratings. Ratings obtained from auto-evals using the boolean rubrics are more consistent/correlated with human ratings. In addition, our results show that replacing all questions with an adaptive set has little impact on the evaluation signal.
  • Figure 4: Comparison of Auto-Adaptive Precise Boolean to Human-Adaptive Precise Boolean rubrics. (A) Adaptation of Precise Boolean rubrics using Gemini 1.5 Pro as a zero-shot rubric question classifier does not degrade rater correlation metrics (intraclass correlation coefficient, ICC) compared to using human driven adaptation. (B) Auto-Adaptive rubrics show a similar average rating trend to Human-Adaptive rubrics, indicating that the Auto-Adaptive evaluation criteria are sufficient to capture the evaluation signals present based on human adaptation.
  • Figure 5: Application of proposed approach on a real health study. (A) Overview of the Wearables for Metabolic Health (WEAR-ME) Study. (B) We filtered participants in the WEAR-ME study based on markers for existing metabolic conditions, particularly obesity (BMI), diabetes (HbA1c), and hypercholesterolemia (LDL) (C) Illustration of our prompt ablation scheme where we altered the generation prompts to not include key blood biomarkers for the incoming queries (D) Measuring the sensitivity of an auto-rater to prompt alterations using Likert rubrics and the proposed Precise Boolean rubrics. Note that the Likert rubrics is normalized (similar to Fig. \ref{['fig:figure3']}) so that the average discrepancy is on the same scale as the Precise Boolean.