Table of Contents
Fetching ...

Between Knowledge and Care: Evaluating Generative AI-Based IUI in Type 2 Diabetes Management Through Patient and Physician Perspectives

Yibo Meng, Ruiqi Chen, Bingyi Liu, Yan Guan, Xiaolan Ding

TL;DR

This study addresses how generative AI-based IUI supports Type 2 diabetes management by integrating patient experiences and physician evaluations in China. It develops a real-world benchmark of 66 patient questions across seven domains and an accompanying five-dimensional rubric (Accuracy, Safety, Clarity, Integrity, Action Orientation) for expert assessment across four AI models. Quantitative results reveal a clear model hierarchy (ChatGPT strongest; others lag with variability) and domain-specific gaps, especially in medication guidance, interpretation, and emotional support. Qualitative insights underscore the need for trust calibration, risk-aware fallbacks, and human–AI collaboration, ultimately arguing for task-aware orchestration and emotionally attuned interfaces to safely integrate AI into chronic-care workflows.

Abstract

Generative AI systems are increasingly adopted by patients seeking everyday health guidance, yet their reliability and clinical appropriateness remain uncertain. Taking Type 2 Diabetes Mellitus (T2DM) as a representative chronic condition, this paper presents a two-part mixed-methods study that examines how patients and physicians in China evaluate the quality and usability of AI-generated health information. Study~1 analyzes 784 authentic patient questions to identify seven core categories of informational needs and five evaluation dimensions -- \textit{Accuracy, Safety, Clarity, Integrity}, and \textit{Action Orientation}. Study~2 involves seven endocrinologists who assess responses from four mainstream AI models across these dimensions. Quantitative and qualitative findings reveal consistent strengths in factual and lifestyle guidance but significant weaknesses in medication interpretation, contextual reasoning, and empathy. Patients view AI as an accessible ``pre-visit educator,'' whereas clinicians highlight its lack of clinical safety and personalization. Together, the findings inform design implications for interactive health systems, advocating for multi-model orchestration, risk-aware fallback mechanisms, and emotionally attuned communication to ensure trustworthy AI assistance in chronic disease care.

Between Knowledge and Care: Evaluating Generative AI-Based IUI in Type 2 Diabetes Management Through Patient and Physician Perspectives

TL;DR

This study addresses how generative AI-based IUI supports Type 2 diabetes management by integrating patient experiences and physician evaluations in China. It develops a real-world benchmark of 66 patient questions across seven domains and an accompanying five-dimensional rubric (Accuracy, Safety, Clarity, Integrity, Action Orientation) for expert assessment across four AI models. Quantitative results reveal a clear model hierarchy (ChatGPT strongest; others lag with variability) and domain-specific gaps, especially in medication guidance, interpretation, and emotional support. Qualitative insights underscore the need for trust calibration, risk-aware fallbacks, and human–AI collaboration, ultimately arguing for task-aware orchestration and emotionally attuned interfaces to safely integrate AI into chronic-care workflows.

Abstract

Generative AI systems are increasingly adopted by patients seeking everyday health guidance, yet their reliability and clinical appropriateness remain uncertain. Taking Type 2 Diabetes Mellitus (T2DM) as a representative chronic condition, this paper presents a two-part mixed-methods study that examines how patients and physicians in China evaluate the quality and usability of AI-generated health information. Study~1 analyzes 784 authentic patient questions to identify seven core categories of informational needs and five evaluation dimensions -- \textit{Accuracy, Safety, Clarity, Integrity}, and \textit{Action Orientation}. Study~2 involves seven endocrinologists who assess responses from four mainstream AI models across these dimensions. Quantitative and qualitative findings reveal consistent strengths in factual and lifestyle guidance but significant weaknesses in medication interpretation, contextual reasoning, and empathy. Patients view AI as an accessible ``pre-visit educator,'' whereas clinicians highlight its lack of clinical safety and personalization. Together, the findings inform design implications for interactive health systems, advocating for multi-model orchestration, risk-aware fallback mechanisms, and emotionally attuned communication to ensure trustworthy AI assistance in chronic disease care.

Paper Structure

This paper contains 39 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Heatmap of System Usability Scale (SUS)-style ratings from Study 1 participants. Each cell shows one participant’s score (1–10) on 10 items, with color encoding magnitude. Right column displays means (SDs).
  • Figure 2: Overall quality evaluation of four AI models. Each box represents the distribution of aggregated scores across all evaluation dimensions and question types for one AI system. The figure provides an overall comparison of model quality.
  • Figure 3: Dimension-wise quality evaluation of four AI models across five key criteria: Accuracy, Safety, Clarity, Integrity, and Action Orientation. Each subplot represents one AI model. Boxplots depict the distribution of scores across all evaluated questions under each criterion.
  • Figure 4: Dimension-wise comparison of AI models across five key quality dimensions: Accuracy, Safety, Clarity, Integrity, and Action Orientation. Each polygon represents one AI model’s average score on the five dimensions. Larger enclosed areas indicate stronger overall performance across dimensions.
  • Figure 5: Overall physician evaluation of AI-generated information quality across seven question categories: Factivity, Diet Management, Sports Advice, Medication Guide, Medication Interpretation, Complications, and Life & Psychology. Each box represents aggregated ratings across all four AI systems, illustrating cross-category variability and the relative difficulty of each question type.
  • ...and 3 more figures