Table of Contents
Fetching ...

DM-Bench: Benchmarking LLMs for Personalized Decision Making in Diabetes Management

Maria Ana Cardei, Josephine Lamp, Mark Derdzinski, Karan Bhatia

TL;DR

DM-Bench presents the first large-scale, patient-facing benchmark for diabetes management with LLMs. It constructs a multimodal, longitudinal data-driven evaluation across 7 real-world tasks using data from 15,000 individuals spanning HW, T1D, and T2D to generate 360,600 personalized questions evaluated on five criteria (accuracy, groundedness, safety, clarity, actionability). The framework combines task curation, extensive data curation (including CGM time-series and behavior logs) and a rigorous LLM-evaluation pipeline, benchmarking 8 LLMs and revealing that no model consistently dominates across all tasks. The results highlight strengths and trade-offs of current models (e.g., GPT-5’s overall performance versus latency and task-specific weaknesses) and emphasize the need for continued advancement in diabetes-specific reasoning, data-grounded outputs, and actionable guidance. By releasing DM-Bench and its extensible framework, the authors aim to accelerate the development of reliable, safe, and useful AI tools for personalized diabetes self-management and related health domains.

Abstract

We present DM-Bench, the first benchmark designed to evaluate large language model (LLM) performance across real-world decision-making tasks faced by individuals managing diabetes in their daily lives. Unlike prior health benchmarks that are either generic, clinician-facing or focused on clinical tasks (e.g., diagnosis, triage), DM-Bench introduces a comprehensive evaluation framework tailored to the unique challenges of prototyping patient-facing AI solutions in diabetes, glucose management, metabolic health and related domains. Our benchmark encompasses 7 distinct task categories, reflecting the breadth of real-world questions individuals with diabetes ask, including basic glucose interpretation, educational queries, behavioral associations, advanced decision making and long term planning. Towards this end, we compile a rich dataset comprising one month of time-series data encompassing glucose traces and metrics from continuous glucose monitors (CGMs) and behavioral logs (e.g., eating and activity patterns) from 15,000 individuals across three different diabetes populations (type 1, type 2, pre-diabetes/general health and wellness). Using this data, we generate a total of 360,600 personalized, contextual questions across the 7 tasks. We evaluate model performance on these tasks across 5 metrics: accuracy, groundedness, safety, clarity and actionability. Our analysis of 8 recent LLMs reveals substantial variability across tasks and metrics; no single model consistently outperforms others across all dimensions. By establishing this benchmark, we aim to advance the reliability, safety, effectiveness and practical utility of AI solutions in diabetes care.

DM-Bench: Benchmarking LLMs for Personalized Decision Making in Diabetes Management

TL;DR

DM-Bench presents the first large-scale, patient-facing benchmark for diabetes management with LLMs. It constructs a multimodal, longitudinal data-driven evaluation across 7 real-world tasks using data from 15,000 individuals spanning HW, T1D, and T2D to generate 360,600 personalized questions evaluated on five criteria (accuracy, groundedness, safety, clarity, actionability). The framework combines task curation, extensive data curation (including CGM time-series and behavior logs) and a rigorous LLM-evaluation pipeline, benchmarking 8 LLMs and revealing that no model consistently dominates across all tasks. The results highlight strengths and trade-offs of current models (e.g., GPT-5’s overall performance versus latency and task-specific weaknesses) and emphasize the need for continued advancement in diabetes-specific reasoning, data-grounded outputs, and actionable guidance. By releasing DM-Bench and its extensible framework, the authors aim to accelerate the development of reliable, safe, and useful AI tools for personalized diabetes self-management and related health domains.

Abstract

We present DM-Bench, the first benchmark designed to evaluate large language model (LLM) performance across real-world decision-making tasks faced by individuals managing diabetes in their daily lives. Unlike prior health benchmarks that are either generic, clinician-facing or focused on clinical tasks (e.g., diagnosis, triage), DM-Bench introduces a comprehensive evaluation framework tailored to the unique challenges of prototyping patient-facing AI solutions in diabetes, glucose management, metabolic health and related domains. Our benchmark encompasses 7 distinct task categories, reflecting the breadth of real-world questions individuals with diabetes ask, including basic glucose interpretation, educational queries, behavioral associations, advanced decision making and long term planning. Towards this end, we compile a rich dataset comprising one month of time-series data encompassing glucose traces and metrics from continuous glucose monitors (CGMs) and behavioral logs (e.g., eating and activity patterns) from 15,000 individuals across three different diabetes populations (type 1, type 2, pre-diabetes/general health and wellness). Using this data, we generate a total of 360,600 personalized, contextual questions across the 7 tasks. We evaluate model performance on these tasks across 5 metrics: accuracy, groundedness, safety, clarity and actionability. Our analysis of 8 recent LLMs reveals substantial variability across tasks and metrics; no single model consistently outperforms others across all dimensions. By establishing this benchmark, we aim to advance the reliability, safety, effectiveness and practical utility of AI solutions in diabetes care.

Paper Structure

This paper contains 43 sections, 15 figures, 16 tables.

Figures (15)

  • Figure 1: DM-Bench spans 7 real-world tasks capturing realistic user needs in diabetes management.
  • Figure 2: DM-Bench overview.
  • Figure 3: Model performance for each metric averaged across all tasks.
  • Figure 4: Percentage of metrics passed for all answers generated by models, where metrics are accuracy, groundedness, safety, clarity, and actionability.
  • Figure 5: Percentage of passing scores across tasks for each metric.
  • ...and 10 more figures