Table of Contents
Fetching ...

ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions

Monica Munnangi, Saiph Savage

Abstract

Medical question-answering benchmarks predominantly evaluate single-turn exchanges, failing to capture the iterative, clarification-seeking nature of real patient consultations. We introduce ThReadMed-QA, a benchmark of 2,437 fully-answered patient-physician conversation threads extracted from r/AskDocs, comprising 8,204 question-answer pairs across up to 9 turns. Unlike prior work relying on simulated dialogues, adversarial prompts, or exam-style questions, ThReadMed-QA captures authentic patient follow-up questions and verified physician responses, reflecting how patients naturally seek medical information online. We evaluate five state-of-the-art LLMs -- GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B -- on a stratified test split of 238 conversations (948 QA pairs) using a calibrated LLM-as-a-judge rubric grounded in physician ground truth. Even the strongest model, GPT-5, achieves only 41.2% fully-correct responses. All five models degrade significantly from turn 0 to turn 2 (p < 0.001), with wrong-answer rates roughly tripling by the third turn. We identify a fundamental tension between single-turn capability and multi-turn reliability: models with the strongest initial performance (GPT-5: 75.2; Claude Haiku: 72.3 out of 100) exhibit the steepest declines by turn 2 (dropping 16.2 and 25.0 points respectively), while weaker models plateau or marginally improve. We introduce two metrics to quantify multi-turn failure modes: Conversational Consistency Score (CCS) and Error Propagation Rate (EPR). CCS reveals that nearly one in three Claude Haiku conversations swings between a fully correct and a completely wrong response within the same thread. EPR shows that a single wrong turn raises the probability of a subsequent wrong turn by 1.9-6.1x across all models.

ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions

Abstract

Medical question-answering benchmarks predominantly evaluate single-turn exchanges, failing to capture the iterative, clarification-seeking nature of real patient consultations. We introduce ThReadMed-QA, a benchmark of 2,437 fully-answered patient-physician conversation threads extracted from r/AskDocs, comprising 8,204 question-answer pairs across up to 9 turns. Unlike prior work relying on simulated dialogues, adversarial prompts, or exam-style questions, ThReadMed-QA captures authentic patient follow-up questions and verified physician responses, reflecting how patients naturally seek medical information online. We evaluate five state-of-the-art LLMs -- GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B -- on a stratified test split of 238 conversations (948 QA pairs) using a calibrated LLM-as-a-judge rubric grounded in physician ground truth. Even the strongest model, GPT-5, achieves only 41.2% fully-correct responses. All five models degrade significantly from turn 0 to turn 2 (p < 0.001), with wrong-answer rates roughly tripling by the third turn. We identify a fundamental tension between single-turn capability and multi-turn reliability: models with the strongest initial performance (GPT-5: 75.2; Claude Haiku: 72.3 out of 100) exhibit the steepest declines by turn 2 (dropping 16.2 and 25.0 points respectively), while weaker models plateau or marginally improve. We introduce two metrics to quantify multi-turn failure modes: Conversational Consistency Score (CCS) and Error Propagation Rate (EPR). CCS reveals that nearly one in three Claude Haiku conversations swings between a fully correct and a completely wrong response within the same thread. EPR shows that a single wrong turn raises the probability of a subsequent wrong turn by 1.9-6.1x across all models.
Paper Structure (37 sections, 2 equations, 4 figures, 9 tables)

This paper contains 37 sections, 2 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: We compare our dataset with the existing medical QA datasets. MedQA jin2020diseasedoespatienthave, PubMedQA jin2019pubmedqadatasetbiomedicalresearch, and MedMCQA pal2022medmcqalargescalemultisubject are derived from medical examinations and are multiple-choice questions. MedQA-Followup manczak2025shallow and MeDiaQA suri2021mediaqaquestionansweringdataset are adversarial follow-up questions with multiple choice options. MedRedQA nguyen2023medredqa, MedicationQA abacha2019bridging and HealthSearchQA 53083 have patient-authored questions but in single-turn. Our dataset ThReadMed-QA is the first to have patient-authored questions and followups, along with physician responses.
  • Figure 2: Dataset construction and filtering process. We selected posts from r/AskDocs from Jan 2015 to Jun 2023, totaling approximately 1.8 million posts. After preprocessing and keeping the thread with the original poster (OP) and physician, we have the final set of 2,437 conversation threads.
  • Figure 3: Mean judge score by conversation turn (turns 0-5, n = 238, 238, 238, 109, 61, 40 respectively). Shaded bands show 95% confidence intervals. All models degrade from turn 0 to turn 2; the degradation is statistically significant across all models (p < 0.001, Mann-Whitney U).
  • Figure 4: Cascade failure rates: percentage of wrong responses at turn t + 1 conditioned on the score at turn t. All models show severe cascade effects after a wrong turn, with failure rates up to 6.1× higher than after a correct turn.