Table of Contents
Fetching ...

Dialogue is Better Than Monologue: Instructing Medical LLMs via Strategical Conversations

Zijie Liu, Xinyu Zhao, Jie Peng, Zhuangdi Zhu, Qingyu Chen, Xia Hu, Tianlong Chen

TL;DR

A novel benchmark that simulates real-world diagnostic scenarios, integrating noise and difficulty levels aligned with USMLE standards is introduced, and dialogue-based fine-tuning is explored, which transforms static datasets into conversational formats to better capture iterative reasoning processes.

Abstract

Current medical AI systems often fail to replicate real-world clinical reasoning, as they are predominantly trained and evaluated on static text and question-answer tasks. These tuning methods and benchmarks overlook critical aspects like evidence-based reasoning and handling distracting information. To bridge this gap, we introduce a novel benchmark that simulates real-world diagnostic scenarios, integrating noise and difficulty levels aligned with USMLE standards. Moreover, we explore dialogue-based fine-tuning, which transforms static datasets into conversational formats to better capture iterative reasoning processes. Experiments show that dialogue-tuned models outperform traditional methods, with improvements of $9.64\%$ in multi-round reasoning scenarios and $6.18\%$ in accuracy in a noisy environment. Our findings highlight dialogue tuning as a promising approach for advancing clinically aligned and robust medical AI systems.

Dialogue is Better Than Monologue: Instructing Medical LLMs via Strategical Conversations

TL;DR

A novel benchmark that simulates real-world diagnostic scenarios, integrating noise and difficulty levels aligned with USMLE standards is introduced, and dialogue-based fine-tuning is explored, which transforms static datasets into conversational formats to better capture iterative reasoning processes.

Abstract

Current medical AI systems often fail to replicate real-world clinical reasoning, as they are predominantly trained and evaluated on static text and question-answer tasks. These tuning methods and benchmarks overlook critical aspects like evidence-based reasoning and handling distracting information. To bridge this gap, we introduce a novel benchmark that simulates real-world diagnostic scenarios, integrating noise and difficulty levels aligned with USMLE standards. Moreover, we explore dialogue-based fine-tuning, which transforms static datasets into conversational formats to better capture iterative reasoning processes. Experiments show that dialogue-tuned models outperform traditional methods, with improvements of in multi-round reasoning scenarios and in accuracy in a noisy environment. Our findings highlight dialogue tuning as a promising approach for advancing clinically aligned and robust medical AI systems.

Paper Structure

This paper contains 24 sections, 4 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Previous medical LLMs are trained on next token prediction with medical text (Article-based tuning) or medical Question-Answer pair (Multi-choice tuning). We find tuning on natural dialogue data (Dialogue tuning) better for learning to reason. To achieve this, we convert raw article and multi-choice QA samples into dialogue samples with Llama-3.1-8B.
  • Figure 2: The MuddyMaze benchmarks encompass two settings: one-round evidence ranking and multi-round evidence ranking. In the one-round evidence ranking, the model is required to identify the correct evidence and output it in order. In the multi-round evidence ranking, the model must update the current information with each selection, iterating via several rounds to reach the endpoint.
  • Figure 3: Format document QA sample to our One-Round evidence ranking sample
  • Figure 4: The left pie chart represents the ratio of difficulty levels in our benchmark. While the right pie chart represents the proportion of multiple-choice question-answering sets and articles used during the tuning stage, the dialogues generated from these sources are equal in quantity to them.
  • Figure 5: Performance comparison of Llama and Qwen models across different dialogue settings ("Dialogue (MC)", "Dialogue (Article)", and "Combined Dialogue") in "One Round" and "Multi Round" scenarios, highlighting that the Combined Dialogue accuracy remains largely consistent with the separate dialogue settings.
  • ...and 2 more figures