Table of Contents
Fetching ...

Can LLMs Correct Physicians, Yet? Investigating Effective Interaction Methods in the Medical Domain

Burcu Sayin, Pasquale Minervini, Jacopo Staiano, Andrea Passerini

TL;DR

This work analyzes how prompt design and physician interaction shape the ability of open-source LLMs to assist medical decision-making. By introducing a PubMedQA-based binary dataset and evaluating three 7B-scale models (Meditron, Llama2, Mistral) across four interaction scenarios, it investigates whether LLMs can correct physician errors, explain their reasoning, and benefit from expert input. Key findings show that carefully crafted prompts enable LLMs to provide plausible explanations and sometimes correct erroneous physician inputs, but the models rarely outperform physicians and larger 70B variants do not guarantee improvement. The study highlights the critical role of prompt engineering for reliable physician–LLM collaboration and outlines directions for future research in real-world clinical validation and safety considerations.

Abstract

We explore the potential of Large Language Models (LLMs) to assist and potentially correct physicians in medical decision-making tasks. We evaluate several LLMs, including Meditron, Llama2, and Mistral, to analyze the ability of these models to interact effectively with physicians across different scenarios. We consider questions from PubMedQA and several tasks, ranging from binary (yes/no) responses to long answer generation, where the answer of the model is produced after an interaction with a physician. Our findings suggest that prompt design significantly influences the downstream accuracy of LLMs and that LLMs can provide valuable feedback to physicians, challenging incorrect diagnoses and contributing to more accurate decision-making. For example, when the physician is accurate 38% of the time, Mistral can produce the correct answer, improving accuracy up to 74% depending on the prompt being used, while Llama2 and Meditron models exhibit greater sensitivity to prompt choice. Our analysis also uncovers the challenges of ensuring that LLM-generated suggestions are pertinent and useful, emphasizing the need for further research in this area.

Can LLMs Correct Physicians, Yet? Investigating Effective Interaction Methods in the Medical Domain

TL;DR

This work analyzes how prompt design and physician interaction shape the ability of open-source LLMs to assist medical decision-making. By introducing a PubMedQA-based binary dataset and evaluating three 7B-scale models (Meditron, Llama2, Mistral) across four interaction scenarios, it investigates whether LLMs can correct physician errors, explain their reasoning, and benefit from expert input. Key findings show that carefully crafted prompts enable LLMs to provide plausible explanations and sometimes correct erroneous physician inputs, but the models rarely outperform physicians and larger 70B variants do not guarantee improvement. The study highlights the critical role of prompt engineering for reliable physician–LLM collaboration and outlines directions for future research in real-world clinical validation and safety considerations.

Abstract

We explore the potential of Large Language Models (LLMs) to assist and potentially correct physicians in medical decision-making tasks. We evaluate several LLMs, including Meditron, Llama2, and Mistral, to analyze the ability of these models to interact effectively with physicians across different scenarios. We consider questions from PubMedQA and several tasks, ranging from binary (yes/no) responses to long answer generation, where the answer of the model is produced after an interaction with a physician. Our findings suggest that prompt design significantly influences the downstream accuracy of LLMs and that LLMs can provide valuable feedback to physicians, challenging incorrect diagnoses and contributing to more accurate decision-making. For example, when the physician is accurate 38% of the time, Mistral can produce the correct answer, improving accuracy up to 74% depending on the prompt being used, while Llama2 and Meditron models exhibit greater sensitivity to prompt choice. Our analysis also uncovers the challenges of ensuring that LLM-generated suggestions are pertinent and useful, emphasizing the need for further research in this area.
Paper Structure (23 sections, 1 figure, 8 tables)

This paper contains 23 sections, 1 figure, 8 tables.

Figures (1)

  • Figure 1: Prompt design. The left figure shows the complete prompt template. We start with task instructions; while a summary is provided here as an example, detailed instructions for each use case can be found in Appendix-\ref{['app:prompt_instructions']}. Then, we incorporate the few-shot examples, with their order varying depending on scenarios 1-4. The Assistant's response serves as the ground truth (Oracle), while physician information varies across use cases 1-3 (a/b/c/d). In the baseline case, no information from the physician is provided. Subsequently, we present the test input, where the user provides context and poses a question, followed by information from the physician depending on the use case. On the right side of the figure, detailed information is provided for few-shot example scenarios and use cases.