Table of Contents
Fetching ...

Overview of the MEDIQA-OE 2025 Shared Task on Medical Order Extraction from Doctor-Patient Consultations

Jean-Philippe Corbeil, Asma Ben Abacha, Jerome Tremblay, Phillip Swazinna, Akila Jeeson Daniel, Miguel Del-Agua, Francois Beaulieu

TL;DR

The paper introduces MEDIQA-OE 2025, the first shared task focused on extracting structured medical orders from doctor-patient conversations to populate EHRs, addressing long, multi-speaker dialogues and mixed-output fields. It benchmarks prompting-based approaches across closed- and open-weight LLMs using datasets ACI-Bench and PriMock57, with evaluation on four fields (description, order_type, reason, provenance) and a composite leaderboard metric. Top results come from GPT-4 with constrained decoding, while open-weight models show a strong size-performance correlation, highlighting the value of model scale in few-shot settings. The study identifies remaining gaps in description and provenance, emphasizes Dataset size as a limiting factor, and suggests future work in data augmentation, finetuning, and hybrid prompting strategies to further reduce documentation burden and improve EHR accuracy.

Abstract

Clinical documentation increasingly uses automatic speech recognition and summarization, yet converting conversations into actionable medical orders for Electronic Health Records remains unexplored. A solution to this problem can significantly reduce the documentation burden of clinicians and directly impact downstream patient care. We introduce the MEDIQA-OE 2025 shared task, the first challenge on extracting medical orders from doctor-patient conversations. Six teams participated in the shared task and experimented with a broad range of approaches, and both closed- and open-weight large language models (LLMs). In this paper, we describe the MEDIQA-OE task, dataset, final leaderboard ranking, and participants' solutions.

Overview of the MEDIQA-OE 2025 Shared Task on Medical Order Extraction from Doctor-Patient Consultations

TL;DR

The paper introduces MEDIQA-OE 2025, the first shared task focused on extracting structured medical orders from doctor-patient conversations to populate EHRs, addressing long, multi-speaker dialogues and mixed-output fields. It benchmarks prompting-based approaches across closed- and open-weight LLMs using datasets ACI-Bench and PriMock57, with evaluation on four fields (description, order_type, reason, provenance) and a composite leaderboard metric. Top results come from GPT-4 with constrained decoding, while open-weight models show a strong size-performance correlation, highlighting the value of model scale in few-shot settings. The study identifies remaining gaps in description and provenance, emphasizes Dataset size as a limiting factor, and suggests future work in data augmentation, finetuning, and hybrid prompting strategies to further reduce documentation burden and improve EHR accuracy.

Abstract

Clinical documentation increasingly uses automatic speech recognition and summarization, yet converting conversations into actionable medical orders for Electronic Health Records remains unexplored. A solution to this problem can significantly reduce the documentation burden of clinicians and directly impact downstream patient care. We introduce the MEDIQA-OE 2025 shared task, the first challenge on extracting medical orders from doctor-patient conversations. Six teams participated in the shared task and experimented with a broad range of approaches, and both closed- and open-weight large language models (LLMs). In this paper, we describe the MEDIQA-OE task, dataset, final leaderboard ranking, and participants' solutions.

Paper Structure

This paper contains 17 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: The medical order extraction task takes a doctor-patient dialog and extracts a JSON list of orders containing four keys (description, order_type, reason, and provenance). Orders that were previously prescribed but not explicitly renewed should be excluded (e.g. omeprazole in this example).
  • Figure 2: Open-weight models ranking obtained with few shots correlates with parameter count.