Table of Contents
Fetching ...

Exploring Large Language Models for Specialist-level Oncology Care

Anil Palepu, Vikram Dhillon, Polly Niravath, Wei-Hung Weng, Preethi Prasad, Khaled Saab, Ryutaro Tanno, Yong Cheng, Hanh Mai, Ethan Burns, Zainub Ajmal, Kavita Kulkarni, Philip Mansfield, Dale Webster, Joelle Barral, Juraj Gottweis, Mike Schaekermann, S. Sara Mahdavi, Vivek Natarajan, Alan Karthikesalingam, Tao Tu

TL;DR

AMIE's performance was overall inferior to attending oncologists suggesting that further research is needed prior to consideration of prospective uses, and how systems such as AMIE might facilitate conversational interactions to assist clinicians in their decision making is demonstrated.

Abstract

Large language models (LLMs) have shown remarkable progress in encoding clinical knowledge and responding to complex medical queries with appropriate clinical reasoning. However, their applicability in subspecialist or complex medical settings remains underexplored. In this work, we probe the performance of AMIE, a research conversational diagnostic AI system, in the subspecialist domain of breast oncology care without specific fine-tuning to this challenging domain. To perform this evaluation, we curated a set of 50 synthetic breast cancer vignettes representing a range of treatment-naive and treatment-refractory cases and mirroring the key information available to a multidisciplinary tumor board for decision-making (openly released with this work). We developed a detailed clinical rubric for evaluating management plans, including axes such as the quality of case summarization, safety of the proposed care plan, and recommendations for chemotherapy, radiotherapy, surgery and hormonal therapy. To improve performance, we enhanced AMIE with the inference-time ability to perform web search retrieval to gather relevant and up-to-date clinical knowledge and refine its responses with a multi-stage self-critique pipeline. We compare response quality of AMIE with internal medicine trainees, oncology fellows, and general oncology attendings under both automated and specialist clinician evaluations. In our evaluations, AMIE outperformed trainees and fellows demonstrating the potential of the system in this challenging and important domain. We further demonstrate through qualitative examples, how systems such as AMIE might facilitate conversational interactions to assist clinicians in their decision making. However, AMIE's performance was overall inferior to attending oncologists suggesting that further research is needed prior to consideration of prospective uses.

Exploring Large Language Models for Specialist-level Oncology Care

TL;DR

AMIE's performance was overall inferior to attending oncologists suggesting that further research is needed prior to consideration of prospective uses, and how systems such as AMIE might facilitate conversational interactions to assist clinicians in their decision making is demonstrated.

Abstract

Large language models (LLMs) have shown remarkable progress in encoding clinical knowledge and responding to complex medical queries with appropriate clinical reasoning. However, their applicability in subspecialist or complex medical settings remains underexplored. In this work, we probe the performance of AMIE, a research conversational diagnostic AI system, in the subspecialist domain of breast oncology care without specific fine-tuning to this challenging domain. To perform this evaluation, we curated a set of 50 synthetic breast cancer vignettes representing a range of treatment-naive and treatment-refractory cases and mirroring the key information available to a multidisciplinary tumor board for decision-making (openly released with this work). We developed a detailed clinical rubric for evaluating management plans, including axes such as the quality of case summarization, safety of the proposed care plan, and recommendations for chemotherapy, radiotherapy, surgery and hormonal therapy. To improve performance, we enhanced AMIE with the inference-time ability to perform web search retrieval to gather relevant and up-to-date clinical knowledge and refine its responses with a multi-stage self-critique pipeline. We compare response quality of AMIE with internal medicine trainees, oncology fellows, and general oncology attendings under both automated and specialist clinician evaluations. In our evaluations, AMIE outperformed trainees and fellows demonstrating the potential of the system in this challenging and important domain. We further demonstrate through qualitative examples, how systems such as AMIE might facilitate conversational interactions to assist clinicians in their decision making. However, AMIE's performance was overall inferior to attending oncologists suggesting that further research is needed prior to consideration of prospective uses.

Paper Structure

This paper contains 24 sections, 22 figures, 5 tables.

Figures (22)

  • Figure 1: Overview of study design and results. (a) Study design. Breast Oncologists evaluate responses from AMIE and six clinicians for the 30 treatment-naive and 20 treatment-refractory cases using the rubric in \ref{['tab:evaluation_rubric_mx', 'tab:evaluation_rubric_other']}. (b) Proportion of favorable responses for each group. On most evaluation criteria, covering aspects of summarization, safety, and management reasoning, AMIE greatly surpasses the performance of trainees, though it falls short of the oncology attendings. See \ref{['fig:results_treatment_naive_mx', 'fig:results_treatment_naive_other', 'fig:results_treatment_refractory_mx', 'fig:results_treatment_refractory_other']} for more detailed breakdowns of each group's performance on the evaluation criteria.
  • Figure 2: Inference strategy for AMIE responses to tumour board cases. AMIE first drafts a response. Then it crafts search queries to gather relevant information, using the results to critique and revise its initial draft and generate a final response.
  • Figure 3: Evaluation rubric for management reasoning. The evaluation rubric for management reasoning criteria. Evaluators were presented with 2-5 answer options per question. "Favorable Options" are answer choices we considered favorable for the analyses that required binary outcomes.
  • Figure 4: Evaluation rubric for summarization, safety, personalization, and diagnostic accuracy. The evaluation rubric for summarization, safety, and other criteria. Evaluators were presented with 2-5 answer options per question. "Favorable Options" are answer choices we considered favorable for the analyses that required binary outcomes.
  • Figure 5: Example of AMIE's assessment and evaluation for a representative treatment-naive case. AMIE's response is shown in the red box on the right, while the evaluation from one of the human evaluators is presented in the blue box in the bottom left.
  • ...and 17 more figures