Table of Contents
Fetching ...

Towards Democratization of Subspeciality Medical Expertise

Jack W. O'Sullivan, Anil Palepu, Khaled Saab, Wei-Hung Weng, Yong Cheng, Emily Chu, Yaanik Desai, Aly Elezaby, Daniel Seung Kim, Roy Lan, Wilson Tang, Natalie Tapaskar, Victoria Parikh, Sneha S. Jain, Kavita Kulkarni, Philip Mansfield, Dale Webster, Juraj Gottweis, Joelle Barral, Mike Schaekermann, Ryutaro Tanno, S. Sara Mahdavi, Vivek Natarajan, Alan Karthikesalingam, Euan Ashley, Tao Tu

TL;DR

This study probes whether a specialized LLM, AMIE, can democratize subspecialist cardiology by evaluating diagnostic dialogue in inherited cardiomyopathies. Using a de-identified open dataset of 204 real-world cases, AMIE was domain-adapted with self-critique and web-search augmentation and compared to general cardiologists through a blinded, multi-domain rubric evaluated by subspecialists. AMIE matched or exceeded general cardiologists in half the domains, and, when cardiologists were allowed to view AMIE outputs, their assessments improved in over 60% of cases, indicating a complementary, assistive role. The findings support the potential of specialized AI to augment subspecialty care, while underscoring the need for prospective validation, safety considerations, and careful integration with expert oversight.

Abstract

The scarcity of subspecialist medical expertise, particularly in rare, complex and life-threatening diseases, poses a significant challenge for healthcare delivery. This issue is particularly acute in cardiology where timely, accurate management determines outcomes. We explored the potential of AMIE (Articulate Medical Intelligence Explorer), a large language model (LLM)-based experimental AI system optimized for diagnostic dialogue, to potentially augment and support clinical decision-making in this challenging context. We curated a real-world dataset of 204 complex cases from a subspecialist cardiology practice, including results for electrocardiograms, echocardiograms, cardiac MRI, genetic tests, and cardiopulmonary stress tests. We developed a ten-domain evaluation rubric used by subspecialists to evaluate the quality of diagnosis and clinical management plans produced by general cardiologists or AMIE, the latter enhanced with web-search and self-critique capabilities. AMIE was rated superior to general cardiologists for 5 of the 10 domains (with preference ranging from 9% to 20%), and equivalent for the rest. Access to AMIE's response improved cardiologists' overall response quality in 63.7% of cases while lowering quality in just 3.4%. Cardiologists' responses with access to AMIE were superior to cardiologist responses without access to AMIE for all 10 domains. Qualitative examinations suggest AMIE and general cardiologist could complement each other, with AMIE thorough and sensitive, while general cardiologist concise and specific. Overall, our results suggest that specialized medical LLMs have the potential to augment general cardiologists' capabilities by bridging gaps in subspecialty expertise, though further research and validation are essential for wide clinical utility.

Towards Democratization of Subspeciality Medical Expertise

TL;DR

This study probes whether a specialized LLM, AMIE, can democratize subspecialist cardiology by evaluating diagnostic dialogue in inherited cardiomyopathies. Using a de-identified open dataset of 204 real-world cases, AMIE was domain-adapted with self-critique and web-search augmentation and compared to general cardiologists through a blinded, multi-domain rubric evaluated by subspecialists. AMIE matched or exceeded general cardiologists in half the domains, and, when cardiologists were allowed to view AMIE outputs, their assessments improved in over 60% of cases, indicating a complementary, assistive role. The findings support the potential of specialized AI to augment subspecialty care, while underscoring the need for prospective validation, safety considerations, and careful integration with expert oversight.

Abstract

The scarcity of subspecialist medical expertise, particularly in rare, complex and life-threatening diseases, poses a significant challenge for healthcare delivery. This issue is particularly acute in cardiology where timely, accurate management determines outcomes. We explored the potential of AMIE (Articulate Medical Intelligence Explorer), a large language model (LLM)-based experimental AI system optimized for diagnostic dialogue, to potentially augment and support clinical decision-making in this challenging context. We curated a real-world dataset of 204 complex cases from a subspecialist cardiology practice, including results for electrocardiograms, echocardiograms, cardiac MRI, genetic tests, and cardiopulmonary stress tests. We developed a ten-domain evaluation rubric used by subspecialists to evaluate the quality of diagnosis and clinical management plans produced by general cardiologists or AMIE, the latter enhanced with web-search and self-critique capabilities. AMIE was rated superior to general cardiologists for 5 of the 10 domains (with preference ranging from 9% to 20%), and equivalent for the rest. Access to AMIE's response improved cardiologists' overall response quality in 63.7% of cases while lowering quality in just 3.4%. Cardiologists' responses with access to AMIE were superior to cardiologist responses without access to AMIE for all 10 domains. Qualitative examinations suggest AMIE and general cardiologist could complement each other, with AMIE thorough and sensitive, while general cardiologist concise and specific. Overall, our results suggest that specialized medical LLMs have the potential to augment general cardiologists' capabilities by bridging gaps in subspecialty expertise, though further research and validation are essential for wide clinical utility.
Paper Structure (25 sections, 14 figures, 5 tables)

This paper contains 25 sections, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Study design. Text reports from the cardiac testing data of 204 patients with suspected genetic cardiovascular disease were provided to AMIE as well as general cardiologists. AMIE and the cardiologists each answered the assessment form listed in \ref{['fig:Assessment']}. Later, the cardiologists were allowed to view AMIE's responses and make any changes to their initial assessments. Subspecialist cardiologists from the Stanford Center for Inherited Cardiovascular Disease provided individual ratings as well as direct preferences between AMIE and cardiologists (\ref{['fig:direct']}) and between the cardiologist responses before and after seeing AMIE's assessment (\ref{['fig:assist']}). Subspecialists were blinded to the source of ratings, and reviewed responses in a randomised sequence.
  • Figure 2: a) Development of AMIE. AMIE was trained with a self-play based simulated learning environment (see tu2024towards for details). We leveraged AMIE without any additional instruction fine-tuning. b) Specialization and evaluation of AMIE. Of the 213 total cases, 9 were used to iterate on the prompting and inference strategy, while the rest were used to test AMIE and the cardiologists. During the study, after individually completing the assessment form in \ref{['fig:Assessment']}, cardiologists could see AMIE's response and alter their response. Subspecialist cardiologists from the Stanford Center for Inherited Cardiovascular Disease provided individual ratings (\ref{['fig:Box2']}) and direct preferences (\ref{['fig:Box3']}) between AMIE and cardiologists, and between the cardiologist responses with and without assistance from AMIE.
  • Figure 3: Assessment Form for AMIE/cardiologist responses to cases. AMIE and cardiologists were provided clinical text from various cardiac testings for each patient and asked to complete the assessment form. They initially completed all but the last question without genetic test results, and then were provided any available genetic test results to answer the last question.
  • Figure 4: Subspecialist Preference Evaluation Form. Subspecialists were provided with two different responses (blinded) and asked to supply their preference (Response 1, Tie, or Response 2) for 9 different aspects of the response as well as the entire response as a whole. The same rubric was used for the direct comparison between AMIE's and the cardiologists' responses as well as between the cardiologists' responses with (assisted) and without (unassisted) access to AMIE's answers.
  • Figure 6: a) Preference between AMIE and cardiologist responses. AMIE responses are preferred over the cardiologist responses for 5 of 10 domains (Consult Question Explanation, Additional Patient Information, Additional Test Information, Management, and Genetics Explanation) and non-inferior for the rest. b) Individual assessment of AMIE and cardiologist responses. Bars indicate the proportion of 'yes' responses for each of the questions in \ref{['fig:Box2']}. AMIE's responses more often have extra content and clinically significant errors, while the cardiologists' responses more often are inapplicable for particular medical demographics.
  • ...and 9 more figures