Table of Contents
Fetching ...

Dialogue to Question Generation for Evidence-based Medical Guideline Agent Development

Zongliang Ji, Ziyang Zhang, Xincheng Tan, Matthew Thompson, Anna Goldenberg, Carl Yang, Rahul G. Krishnan, Fan Zhang

Abstract

Evidence-based medicine (EBM) is central to high-quality care, but remains difficult to implement in fast-paced primary care settings. Physicians face short consultations, increasing patient loads, and lengthy guideline documents that are impractical to consult in real time. To address this gap, we investigate the feasibility of using large language models (LLMs) as ambient assistants that surface targeted, evidence-based questions during physician-patient encounters. Our study focuses on question generation rather than question answering, with the aim of scaffolding physician reasoning and integrating guideline-based practice into brief consultations. We implemented two prompting strategies, a zero-shot baseline and a multi-stage reasoning variant, using Gemini 2.5 as the backbone model. We evaluated on a benchmark of 80 de-identified transcripts from real clinical encounters, with six experienced physicians contributing over 90 hours of structured review. Results indicate that while general-purpose LLMs are not yet fully reliable, they can produce clinically meaningful and guideline-relevant questions, suggesting significant potential to reduce cognitive burden and make EBM more actionable at the point of care.

Dialogue to Question Generation for Evidence-based Medical Guideline Agent Development

Abstract

Evidence-based medicine (EBM) is central to high-quality care, but remains difficult to implement in fast-paced primary care settings. Physicians face short consultations, increasing patient loads, and lengthy guideline documents that are impractical to consult in real time. To address this gap, we investigate the feasibility of using large language models (LLMs) as ambient assistants that surface targeted, evidence-based questions during physician-patient encounters. Our study focuses on question generation rather than question answering, with the aim of scaffolding physician reasoning and integrating guideline-based practice into brief consultations. We implemented two prompting strategies, a zero-shot baseline and a multi-stage reasoning variant, using Gemini 2.5 as the backbone model. We evaluated on a benchmark of 80 de-identified transcripts from real clinical encounters, with six experienced physicians contributing over 90 hours of structured review. Results indicate that while general-purpose LLMs are not yet fully reliable, they can produce clinically meaningful and guideline-relevant questions, suggesting significant potential to reduce cognitive burden and make EBM more actionable at the point of care.
Paper Structure (12 sections, 3 equations, 5 figures, 1 table)

This paper contains 12 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of our work. (a) We sample 80 primary care cases from 2,000 real-world physician-patient dialogue transcripts. (b) We develop two methods to generate evidence-based medical questions from dialogue using Gemini 2.5. (c) We conduct a pilot study with more than 10 internal clinical experts regarding the evaluation metrics, pilot labeling, and rationale of designing the experiments. (d) We perform auto evaluation with LLM and human evaluation on 80 cases with 6 experienced clinicians.
  • Figure 2: Stacked bar plots of (a) the human evaluation by six clinicians and (b) the automated evaluation by Gemini on 80 dialogue transcripts. M1 refers to the multi-stage reasoning method and M2 refers to the zero-shot baseline. Each method is rated on a Likert-scale from 1 – Strongly Disagree to 7 – Strongly Agree.
  • Figure 3: Averaged scores of the proposed framework rated by PCPs at 30%, 70%, and 100% dialogue context.
  • Figure 4: Proportional trend of question types preferred by PCPs at 30%, 70%, and 100% dialogue context. Only questions generated by the proposed framework are included. The proportions may be biased by an uneven initial distribution of question types.
  • Figure 5: Clinicians exhibit different rating styles. On the X-axis, P1 to P6 means participated clinicians. The Y-axis, shows the range of Likert score for the evaluation result. Each group of box plots shows a clinician’s ratings of the multi-stage reasoning method across 40 annotated cases, evaluated under 30%, 70%, and 100% dialogue context.