Table of Contents
Fetching ...

Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models

Christopher Chiu, Silviu Pitis, Mihaela van der Schaar

TL;DR

This work introduces VivaBench, an open-source, multi-turn benchmark to evaluate sequential clinical reasoning in large language models by simulating viva voce-style medical examinations. It encodes clinical cases as structured vignettes with History, Physical, Imaging, and Laboratory data, plus ground-truth diagnoses, and forces agents to iteratively gather information, update hypotheses, and justify diagnoses. Across six state-of-the-art LLMs, performance heavily degrades in the interactive setting, revealing failure modes such as anchoring, inappropriate test ordering, premature closure, and poor screening for critical conditions, along with variable confidence calibration. The framework combines deterministic and LLM-based mappers with a parsing layer to produce reproducible, clinically grounded interactions and metrics, contributing a rigorous benchmark for clinical decision support and insights into agentic AI in high-stakes environments.

Abstract

Clinical reasoning in medicine is a hypothesis-driven process where physicians refine diagnoses from limited information through targeted history, physical examination, and diagnostic investigations. In contrast, current medical benchmarks for large language models (LLMs) primarily assess knowledge recall through single-turn questions, where complete clinical information is provided upfront. To address this gap, we introduce VivaBench, a multi-turn benchmark that evaluates sequential clinical reasoning in LLM agents. Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training, requiring agents to actively probe for relevant findings, select appropriate investigations, and synthesize information across multiple steps to reach a diagnosis. While current LLMs demonstrate competence in diagnosing conditions from well-described clinical presentations, their performance degrades significantly when required to navigate iterative diagnostic reasoning under uncertainty in our evaluation. Our analysis identified several failure modes that mirror common cognitive errors in clinical practice, including: (1) fixation on initial hypotheses, (2) inappropriate investigation ordering, (3) premature diagnostic closure, and (4) failing to screen for critical conditions. These patterns reveal fundamental limitations in how current LLMs reason and make decisions under uncertainty. Through VivaBench, we provide a standardized benchmark for evaluating conversational medical AI systems for real-world clinical decision support. Beyond medical applications, we contribute to the larger corpus of research on agentic AI by demonstrating how sequential reasoning trajectories can diverge in complex decision-making environments.

Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models

TL;DR

This work introduces VivaBench, an open-source, multi-turn benchmark to evaluate sequential clinical reasoning in large language models by simulating viva voce-style medical examinations. It encodes clinical cases as structured vignettes with History, Physical, Imaging, and Laboratory data, plus ground-truth diagnoses, and forces agents to iteratively gather information, update hypotheses, and justify diagnoses. Across six state-of-the-art LLMs, performance heavily degrades in the interactive setting, revealing failure modes such as anchoring, inappropriate test ordering, premature closure, and poor screening for critical conditions, along with variable confidence calibration. The framework combines deterministic and LLM-based mappers with a parsing layer to produce reproducible, clinically grounded interactions and metrics, contributing a rigorous benchmark for clinical decision support and insights into agentic AI in high-stakes environments.

Abstract

Clinical reasoning in medicine is a hypothesis-driven process where physicians refine diagnoses from limited information through targeted history, physical examination, and diagnostic investigations. In contrast, current medical benchmarks for large language models (LLMs) primarily assess knowledge recall through single-turn questions, where complete clinical information is provided upfront. To address this gap, we introduce VivaBench, a multi-turn benchmark that evaluates sequential clinical reasoning in LLM agents. Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training, requiring agents to actively probe for relevant findings, select appropriate investigations, and synthesize information across multiple steps to reach a diagnosis. While current LLMs demonstrate competence in diagnosing conditions from well-described clinical presentations, their performance degrades significantly when required to navigate iterative diagnostic reasoning under uncertainty in our evaluation. Our analysis identified several failure modes that mirror common cognitive errors in clinical practice, including: (1) fixation on initial hypotheses, (2) inappropriate investigation ordering, (3) premature diagnostic closure, and (4) failing to screen for critical conditions. These patterns reveal fundamental limitations in how current LLMs reason and make decisions under uncertainty. Through VivaBench, we provide a standardized benchmark for evaluating conversational medical AI systems for real-world clinical decision support. Beyond medical applications, we contribute to the larger corpus of research on agentic AI by demonstrating how sequential reasoning trajectories can diverge in complex decision-making environments.

Paper Structure

This paper contains 24 sections, 3 equations, 7 figures, 9 tables, 4 algorithms.

Figures (7)

  • Figure 1: Action and reasoning trace of two evaluated models on our simulated viva voce examination. Given the initial scenario (green), agents (blue) are tasked to diagnose the patient (orange), which was simulated by our evaluation framework. c indicates confidence of diagnosis. Failure to perform a targeted clinical review (left) could lead to significant ramifications, such as a missed diagnosis of heart attack (correctly diagnosed on the right).
  • Figure 2: Radar plot comparing precision and recall metrics for how effectively models gather clinical information. Targeted metrics assess performance on gathering diagnosis-relevant information only, while Overall metrics include all available clinical information. Review includes history-taking and physical examination, while Investigations covers labs and imaging. Higher values are better.
  • Figure 3: Accuracy of top-1 to top-5 diagnoses.
  • Figure 4: Performance across specialties.
  • Figure 5: Kernel density estimation (KDE) contours representing the distribution model performance across clinical cases. The quadrants reflect differences in confidence and accuracy of diagnosis: confidently accurate (top-right), overconfident and wrong (top-left), underconfident but accurate (bottom-right), and appropriately uncertain (bottom-left).
  • ...and 2 more figures