Table of Contents
Fetching ...

Advice for Diabetes Self-Management by ChatGPT Models: Challenges and Recommendations

Waqar Hussain, John Grundy

TL;DR

This study assesses how ChatGPT-3.5 and GPT-4 perform on diabetes self-management queries, revealing improvements with GPT-4 but continuing gaps in personalized, context-aware guidance, unit handling, and emergency advice. It demonstrates specific failure modes such as misinterpreting blood glucose units, misdiagnosing pseudo-hypoglycemia, and insufficient insulin-storage guidance. To mitigate risks, the authors propose a commonsense evaluation layer and Retrieval Augmented Generation (RAG) to anchor diabetes guidance to external, current medical sources. The findings underscore the need for human oversight in clinical DSMES applications while outlining a concrete path toward safer, more reliable AI-assisted diabetes care. Collectively, the work informs future AI integration in DSMES and highlights practical avenues to enhance accuracy, equity, and clinical safety.

Abstract

Given their ability for advanced reasoning, extensive contextual understanding, and robust question-answering abilities, large language models have become prominent in healthcare management research. Despite adeptly handling a broad spectrum of healthcare inquiries, these models face significant challenges in delivering accurate and practical advice for chronic conditions such as diabetes. We evaluate the responses of ChatGPT versions 3.5 and 4 to diabetes patient queries, assessing their depth of medical knowledge and their capacity to deliver personalized, context-specific advice for diabetes self-management. Our findings reveal discrepancies in accuracy and embedded biases, emphasizing the models' limitations in providing tailored advice unless activated by sophisticated prompting techniques. Additionally, we observe that both models often provide advice without seeking necessary clarification, a practice that can result in potentially dangerous advice. This underscores the limited practical effectiveness of these models without human oversight in clinical settings. To address these issues, we propose a commonsense evaluation layer for prompt evaluation and incorporating disease-specific external memory using an advanced Retrieval Augmented Generation technique. This approach aims to improve information quality and reduce misinformation risks, contributing to more reliable AI applications in healthcare settings. Our findings seek to influence the future direction of AI in healthcare, enhancing both the scope and quality of its integration.

Advice for Diabetes Self-Management by ChatGPT Models: Challenges and Recommendations

TL;DR

This study assesses how ChatGPT-3.5 and GPT-4 perform on diabetes self-management queries, revealing improvements with GPT-4 but continuing gaps in personalized, context-aware guidance, unit handling, and emergency advice. It demonstrates specific failure modes such as misinterpreting blood glucose units, misdiagnosing pseudo-hypoglycemia, and insufficient insulin-storage guidance. To mitigate risks, the authors propose a commonsense evaluation layer and Retrieval Augmented Generation (RAG) to anchor diabetes guidance to external, current medical sources. The findings underscore the need for human oversight in clinical DSMES applications while outlining a concrete path toward safer, more reliable AI-assisted diabetes care. Collectively, the work informs future AI integration in DSMES and highlights practical avenues to enhance accuracy, equity, and clinical safety.

Abstract

Given their ability for advanced reasoning, extensive contextual understanding, and robust question-answering abilities, large language models have become prominent in healthcare management research. Despite adeptly handling a broad spectrum of healthcare inquiries, these models face significant challenges in delivering accurate and practical advice for chronic conditions such as diabetes. We evaluate the responses of ChatGPT versions 3.5 and 4 to diabetes patient queries, assessing their depth of medical knowledge and their capacity to deliver personalized, context-specific advice for diabetes self-management. Our findings reveal discrepancies in accuracy and embedded biases, emphasizing the models' limitations in providing tailored advice unless activated by sophisticated prompting techniques. Additionally, we observe that both models often provide advice without seeking necessary clarification, a practice that can result in potentially dangerous advice. This underscores the limited practical effectiveness of these models without human oversight in clinical settings. To address these issues, we propose a commonsense evaluation layer for prompt evaluation and incorporating disease-specific external memory using an advanced Retrieval Augmented Generation technique. This approach aims to improve information quality and reduce misinformation risks, contributing to more reliable AI applications in healthcare settings. Our findings seek to influence the future direction of AI in healthcare, enhancing both the scope and quality of its integration.
Paper Structure (31 sections, 2 figures, 8 tables)

This paper contains 31 sections, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Good and bad advice from ChatGPT 4
  • Figure 2: Enhanced Adaptability and integration of Chat GPT and similar language models in healthcare with improved accuracy and reduced hallucination based on Advanced Retrieval Augmented Generation Model Architecture