Table of Contents
Fetching ...

MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?

Lin Yang, Yuancheng Yang, Xu Wang, Changkun Liu, Haihua Yang

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities across various specialist domains and have been integrated into high-stakes areas such as medicine. However, as existing medical-related benchmarks rarely stress-test the long-context memory, interference robustness, and safety defense required in practice. To bridge this gap, we introduce MedMT-Bench, a challenging medical multi-turn instruction following benchmark that simulates the entire diagnosis and treatment process. We construct the benchmark via scene-by-scene data synthesis refined by manual expert editing, yielding 400 test cases that are highly consistent with real-world application scenarios. Each test case has an average of 22 rounds (maximum of 52 rounds), covering 5 types of difficult instruction following issues. For evaluation, we propose an LLM-as-judge protocol with instance-level rubrics and atomic test points, validated against expert annotations with a human-LLM agreement of 91.94\%. We test 17 frontier models, all of which underperform on MedMT-Bench (overall accuracy below 60.00\%), with the best model reaching 59.75\%. MedMT-Bench can be an essential tool for driving future research towards safer and more reliable medical AI. The benchmark is available in https://openreview.net/attachment?id=aKyBCsPOHB&name=supplementary_material

MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities across various specialist domains and have been integrated into high-stakes areas such as medicine. However, as existing medical-related benchmarks rarely stress-test the long-context memory, interference robustness, and safety defense required in practice. To bridge this gap, we introduce MedMT-Bench, a challenging medical multi-turn instruction following benchmark that simulates the entire diagnosis and treatment process. We construct the benchmark via scene-by-scene data synthesis refined by manual expert editing, yielding 400 test cases that are highly consistent with real-world application scenarios. Each test case has an average of 22 rounds (maximum of 52 rounds), covering 5 types of difficult instruction following issues. For evaluation, we propose an LLM-as-judge protocol with instance-level rubrics and atomic test points, validated against expert annotations with a human-LLM agreement of 91.94\%. We test 17 frontier models, all of which underperform on MedMT-Bench (overall accuracy below 60.00\%), with the best model reaching 59.75\%. MedMT-Bench can be an essential tool for driving future research towards safer and more reliable medical AI. The benchmark is available in https://openreview.net/attachment?id=aKyBCsPOHB&name=supplementary_material
Paper Structure (39 sections, 3 equations, 14 figures, 10 tables)

This paper contains 39 sections, 3 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: An example of multi-patient information interference in the domain of long context memory and understanding in the pre-diagnosis stage.
  • Figure 2: Test examples in the remaining four multi-turn difficult dimensions.
  • Figure 3: The distribution of complex multi-turn instruction following problems.
  • Figure 4: Example of test points for synthesis.
  • Figure 5: Workflow of MedMT-Bench. The upper panel depicts the single-stage data synthesis process, which combines multi-agent conversation synthesis, verification, and manual editing to produce challenging examples targeting specific evaluation dimensions. The lower panel shows the multi-stage, scene-by-scene synthesis pipeline. Subsequent scene test cases build on the multi-turn conversations from the preceding scene and derive inputs via a portrait extraction strategy.
  • ...and 9 more figures