Table of Contents
Fetching ...

Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin

Abstract

Patients and clinicians are increasingly using chatbots powered by large language models (LLMs) for healthcare inquiries. While state-of-the-art LLMs exhibit high performance on static diagnostic reasoning benchmarks, their efficacy across multi-turn conversations, which better reflect real-world usage, has been understudied. In this paper, we evaluate 17 LLMs across three clinical datasets to investigate how partitioning the decision-space into multiple simpler turns of conversation influences their diagnostic reasoning. Specifically, we develop a "stick-or-switch" evaluation framework to measure model conviction (i.e., defending a correct diagnosis or safe abstention against incorrect suggestions) and flexibility (i.e., recognizing a correct suggestion when it is introduced) across conversations. Our experiments reveal the conversation tax, where multi-turn interactions consistently degrade performance when compared to single-shot baselines. Notably, models frequently abandon initial correct diagnoses and safe abstentions to align with incorrect user suggestions. Additionally, several models exhibit blind switching, failing to distinguish between signal and incorrect suggestions.

Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning

Abstract

Patients and clinicians are increasingly using chatbots powered by large language models (LLMs) for healthcare inquiries. While state-of-the-art LLMs exhibit high performance on static diagnostic reasoning benchmarks, their efficacy across multi-turn conversations, which better reflect real-world usage, has been understudied. In this paper, we evaluate 17 LLMs across three clinical datasets to investigate how partitioning the decision-space into multiple simpler turns of conversation influences their diagnostic reasoning. Specifically, we develop a "stick-or-switch" evaluation framework to measure model conviction (i.e., defending a correct diagnosis or safe abstention against incorrect suggestions) and flexibility (i.e., recognizing a correct suggestion when it is introduced) across conversations. Our experiments reveal the conversation tax, where multi-turn interactions consistently degrade performance when compared to single-shot baselines. Notably, models frequently abandon initial correct diagnoses and safe abstentions to align with incorrect user suggestions. Additionally, several models exhibit blind switching, failing to distinguish between signal and incorrect suggestions.
Paper Structure (2 sections, 1 equation, 5 figures)

This paper contains 2 sections, 1 equation, 5 figures.

Table of Contents

  1. Results
  2. Acknowledgments

Figures (5)

  • Figure 1: Measuring conviction and flexibility in LLM clinical decision-making through multi-turn conversational exchange.(a) Positive conviction, where a model must defend a correct initial diagnosis against subsequent incorrect suggestions. (b) Negative conviction, where the model must maintain an initial safe abstention against subsequent incorrect suggestions. (c) Flexibility, where the model must recognize the introduction of the clinical truth after abstaining against an incorrect option. Note that positive and negative conviction may extend up to four turns. The true answer for the example question is highlighted in green.
  • Figure 2: The effect of narrowing the original decision-space to a binary one. (a) Accuracy improvement transitioning from the original answer-space to a simpler binary one for three datasets across increasing model size. (b) Abstention rate improvement of the same datasets and models.
  • Figure 3: The effect of multi-turn conversation on end-to-end accuracy. (a) Positive conviction, or the cumulative survival rate ($C_t$) of an initially correct diagnosis, over $t$ successive turns compared to the single-shot baseline for JAMA CC. Each line represents the conviction of a single model colored by parameter count. (b) End-to-end accuracy comparison between single-shot (SS) and multi-turn (MT) presentation for all models and datasets.
  • Figure 4: The effect of multi-turn conversation on end-to-end abstention rates. (a) Negative conviction, or the cumulative survival rate ($C_t$) of an initially correct abstention, over $t$ successive turns compared to the single-shot baseline for JAMA CC. Each line represents the conviction of a single model colored by parameter count. (b) End-to-end abstention comparison between single-shot (SS) and multi-turn (MT) presentation for all models and datasets
  • Figure 5: Evaluation of model flexibility and susceptibility to blind switching. (a) Correct switch rates (adopting the correct diagnosis after initially abstaining) versus incorrect switch rates (adopting an incorrect suggestion after initially abstaining) for the JAMA CC dataset, with marker sizes scaled proportionally to model parameter counts. Ideal flexibility, where models switch only when offered a correct suggestion, approaches the bottom right quadrant. Unknowledgeable behavior, switching only to incorrect suggestions, approaches the top left quadrant, and blind switching $(y=x)$ sits between the two. (b) Comparison between rates of switching to correct (+) and incorrect (-) suggestions across datasets and models.