Table of Contents
Fetching ...

Learnings from a Large-Scale Deployment of an LLM-Powered Expert-in-the-Loop Healthcare Chatbot

Bhuvan Sachdeva, Pragnya Ramjee, Geeta Fulari, Kaushik Murali, Mohit Jain

TL;DR

The paper reports a large-scale, in-the-wild deployment of CataractBot, an LLM-powered expert-in-the-loop chatbot for cataract surgery queries, implemented via the BYOeB platform. Over 24 weeks, 318 participants generated 1,992 messages, with 91.71% of responses verified by seven experts, yielding high medical accuracy (84.52%) and substantial improvement in performance as the knowledge base expanded (↑19.02% accuracy) and expert workload declined (↓I don’t know by 7.84%). End-users primarily sought medical information, with usage peaking the day before surgery; text was the preferred modality, while expert corrections revealed patterns in adding new information and addressing patient-specific needs. The study provides design insights for proactive information delivery, personalized knowledge bases, multilingual support, longer conversational history, and robust knowledge management to mitigate inconsistencies and improve reliability in real-world clinical settings.

Abstract

Large Language Models (LLMs) are widely used in healthcare, but limitations like hallucinations, incomplete information, and bias hinder their reliability. To address these, researchers released the Build Your Own expert Bot (BYOeB) platform, enabling developers to create LLM-powered chatbots with integrated expert verification. CataractBot, its first implementation, provides expert-verified responses to cataract surgery questions. A pilot evaluation showed its potential; however the study had a small sample size and was primarily qualitative. In this work, we conducted a large-scale 24-week deployment of CataractBot involving 318 patients and attendants who sent 1,992 messages, with 91.71% of responses verified by seven experts. Analysis of interaction logs revealed that medical questions significantly outnumbered logistical ones, hallucinations were negligible, and experts rated 84.52% of medical answers as accurate. As the knowledge base expanded with expert corrections, system performance improved by 19.02%, reducing expert workload. These insights guide the design of future LLM-powered chatbots.

Learnings from a Large-Scale Deployment of an LLM-Powered Expert-in-the-Loop Healthcare Chatbot

TL;DR

The paper reports a large-scale, in-the-wild deployment of CataractBot, an LLM-powered expert-in-the-loop chatbot for cataract surgery queries, implemented via the BYOeB platform. Over 24 weeks, 318 participants generated 1,992 messages, with 91.71% of responses verified by seven experts, yielding high medical accuracy (84.52%) and substantial improvement in performance as the knowledge base expanded (↑19.02% accuracy) and expert workload declined (↓I don’t know by 7.84%). End-users primarily sought medical information, with usage peaking the day before surgery; text was the preferred modality, while expert corrections revealed patterns in adding new information and addressing patient-specific needs. The study provides design insights for proactive information delivery, personalized knowledge bases, multilingual support, longer conversational history, and robust knowledge management to mitigate inconsistencies and improve reliability in real-world clinical settings.

Abstract

Large Language Models (LLMs) are widely used in healthcare, but limitations like hallucinations, incomplete information, and bias hinder their reliability. To address these, researchers released the Build Your Own expert Bot (BYOeB) platform, enabling developers to create LLM-powered chatbots with integrated expert verification. CataractBot, its first implementation, provides expert-verified responses to cataract surgery questions. A pilot evaluation showed its potential; however the study had a small sample size and was primarily qualitative. In this work, we conducted a large-scale 24-week deployment of CataractBot involving 318 patients and attendants who sent 1,992 messages, with 91.71% of responses verified by seven experts. Analysis of interaction logs revealed that medical questions significantly outnumbered logistical ones, hallucinations were negligible, and experts rated 84.52% of medical answers as accurate. As the knowledge base expanded with expert corrections, system performance improved by 19.02%, reducing expert workload. These insights guide the design of future LLM-powered chatbots.
Paper Structure (13 sections, 2 figures, 1 table)

This paper contains 13 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: (A) Bot's performance for medical and logistical questions over the 24-week deployment, including its accuracy and completeness, based on the proportion of 'Yes' responses from experts, and the proportion of bot's "I don't know" answers. In the first four weeks, 65.60% of the bot's answers were marked as 'accurate and complete', which increased to 84.62% in the final four weeks. (B) Distribution of medical questions asked by patients and attendants relative to the day of surgery (Day 0). (C) Distribution of logistical questions asked by patients and attendants relative to the day of surgery (Day 0).
  • Figure 2: Life cycle of information on CataractBot. (A) Patient's message. (B) Bot's initial answer. (C) Expert's correction. (D) Bot's updated answer with changes highlighted. (E) Knowledge Base Update Expert's edited version of the answer with changes highlighted, added to the knowledge base.