Table of Contents
Fetching ...

Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction

Jinrui Fang, Runhan Chen, Xu Yang, Jian Yu, Jiawei Xu, Ashwin Vinod, Wenqi Shi, Tianlong Chen, Heng Ji, ChengXiang Zhai, Ying Ding, Yuji Zhang

Abstract

Large language models (LLMs) achieve high accuracy in medical diagnosis when all clinical information is provided in a single turn, yet how they behave under multi-turn evidence accumulation closer to real clinical reasoning remains unexplored. We introduce MINT (Medical Incremental N-Turn Benchmark), a high-fidelity, multi-turn medical diagnosis benchmark comprising 1,035 cases with clinically labeled evidence shards, controlled turn granularity, and information-preserving decomposition. Through systematic evaluation of 11 LLMs on MINT, we uncover three persistent behavioral patterns that significantly impact diagnostic decisions: (1) intent to answer, models rush to answer before sufficient evidence has been observed, with over 55% of answers committed within the first two turns; (2) self-correction, incorrect-to-correct answer revisions occur at up to 10.6 times the rate of correct-to-incorrect flips, revealing a latent capacity for self-correction that premature commitment forecloses; and (3) strong lures, clinically salient information such as laboratory results trigger premature answering even when models are explicitly instructed to wait. We translate these findings into clinically actionable guidance: deferring the diagnostic question to later turns reduces premature answering and improves accuracy at the first point of commitment by up to 62.6%, while reserving salient clinical evidence for later turns prevents a catastrophic accuracy drop of up to 23.3% caused by premature commitment. Our work provides both a controlled evaluation framework and concrete recommendations for improving the reliability of LLMs in multi-turn medical diagnosis.

Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction

Abstract

Large language models (LLMs) achieve high accuracy in medical diagnosis when all clinical information is provided in a single turn, yet how they behave under multi-turn evidence accumulation closer to real clinical reasoning remains unexplored. We introduce MINT (Medical Incremental N-Turn Benchmark), a high-fidelity, multi-turn medical diagnosis benchmark comprising 1,035 cases with clinically labeled evidence shards, controlled turn granularity, and information-preserving decomposition. Through systematic evaluation of 11 LLMs on MINT, we uncover three persistent behavioral patterns that significantly impact diagnostic decisions: (1) intent to answer, models rush to answer before sufficient evidence has been observed, with over 55% of answers committed within the first two turns; (2) self-correction, incorrect-to-correct answer revisions occur at up to 10.6 times the rate of correct-to-incorrect flips, revealing a latent capacity for self-correction that premature commitment forecloses; and (3) strong lures, clinically salient information such as laboratory results trigger premature answering even when models are explicitly instructed to wait. We translate these findings into clinically actionable guidance: deferring the diagnostic question to later turns reduces premature answering and improves accuracy at the first point of commitment by up to 62.6%, while reserving salient clinical evidence for later turns prevents a catastrophic accuracy drop of up to 23.3% caused by premature commitment. Our work provides both a controlled evaluation framework and concrete recommendations for improving the reliability of LLMs in multi-turn medical diagnosis.

Paper Structure

This paper contains 23 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Overview of MINT benchmark construction and evaluation settings.
  • Figure 2: Diagnostic impatience across information accumulation and model types. Colors denote hold, correct, and incorrect responses. (a) Distribution of initial-answer turns under different conversation lengths (N=4, 8, 12, 16) for GPT-5-mini. (b) Turn-level distribution of hold, correct, and incorrect responses for GPT-5-mini by number of shards seen. (c) Guess rates across models.
  • Figure 3: Proportion of Correct, Incorrect, and Hold responses across six turns for 11 models. For most models, hold responses decrease over turns while the proportion of correct answers increases by the final turn.
  • Figure 4: Effects of lab-result timing on response behavior and self-correction for GPT-5 mini in 10-turn cases. (a) Distribution of the first-answer turn under early, middle, and late lab-result placement, shown separately for lab-dependent and non-lab-dependent diseases. (b) Turn-level response composition for each lab-result order, showing correct, incorrect, and hold states across turns. (c) Heatmaps of False-to-True (left) and True-to-False (right) transitions by window size/turn gap, stratified by disease type and lab-result placement.
  • Figure 5: (a) Running answer states across turns for GPT-5-mini under different shard counts ($N = 4, 8, 12, 16$). As more evidence is revealed, the model’s current answer state can still improve, with some initially incorrect states later corrected. (b) Cumulative first-answer outcomes across turns. Most first committed answers occur early, and many of these early commitments are incorrect.
  • ...and 3 more figures