Table of Contents
Fetching ...

MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes

Asma Ben Abacha, Wen-wai Yim, Yujuan Fu, Zhaoyi Sun, Meliha Yetisgen, Fei Xia, Thomas Lin

TL;DR

MEDEC introduces the first public benchmark for medical error detection and correction in clinical notes, addressing the need for automated validation of medical text produced by or assisted by LLMs. It combines two data-creation pipelines to assemble 3,848 notes across five error types and provides training, validation, and test splits with both MS and UW cohorts. The study evaluates a range of LLMs alongside physicians using three subtasks and multiple metrics, finding that while modern models show competitive error detection and correction, clinicians still outperform them overall. The work highlights the importance of robust evaluation metrics and data diversity for safe deployment of AI-assisted clinical documentation and outlines avenues for future improvements.

Abstract

Several studies showed that Large Language Models (LLMs) can answer medical questions correctly, even outperforming the average human score in some medical exams. However, to our knowledge, no study has been conducted to assess the ability of language models to validate existing or generated medical text for correctness and consistency. In this paper, we introduce MEDEC (https://github.com/abachaa/MEDEC), the first publicly available benchmark for medical error detection and correction in clinical notes, covering five types of errors (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism). MEDEC consists of 3,848 clinical texts, including 488 clinical notes from three US hospital systems that were not previously seen by any LLM. The dataset has been used for the MEDIQA-CORR shared task to evaluate seventeen participating systems [Ben Abacha et al., 2024]. In this paper, we describe the data creation methods and we evaluate recent LLMs (e.g., o1-preview, GPT-4, Claude 3.5 Sonnet, and Gemini 2.0 Flash) for the tasks of detecting and correcting medical errors requiring both medical knowledge and reasoning capabilities. We also conducted a comparative study where two medical doctors performed the same task on the MEDEC test set. The results showed that MEDEC is a sufficiently challenging benchmark to assess the ability of models to validate existing or generated notes and to correct medical errors. We also found that although recent LLMs have a good performance in error detection and correction, they are still outperformed by medical doctors in these tasks. We discuss the potential factors behind this gap, the insights from our experiments, the limitations of current evaluation metrics, and share potential pointers for future research.

MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes

TL;DR

MEDEC introduces the first public benchmark for medical error detection and correction in clinical notes, addressing the need for automated validation of medical text produced by or assisted by LLMs. It combines two data-creation pipelines to assemble 3,848 notes across five error types and provides training, validation, and test splits with both MS and UW cohorts. The study evaluates a range of LLMs alongside physicians using three subtasks and multiple metrics, finding that while modern models show competitive error detection and correction, clinicians still outperform them overall. The work highlights the importance of robust evaluation metrics and data diversity for safe deployment of AI-assisted clinical documentation and outlines avenues for future improvements.

Abstract

Several studies showed that Large Language Models (LLMs) can answer medical questions correctly, even outperforming the average human score in some medical exams. However, to our knowledge, no study has been conducted to assess the ability of language models to validate existing or generated medical text for correctness and consistency. In this paper, we introduce MEDEC (https://github.com/abachaa/MEDEC), the first publicly available benchmark for medical error detection and correction in clinical notes, covering five types of errors (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism). MEDEC consists of 3,848 clinical texts, including 488 clinical notes from three US hospital systems that were not previously seen by any LLM. The dataset has been used for the MEDIQA-CORR shared task to evaluate seventeen participating systems [Ben Abacha et al., 2024]. In this paper, we describe the data creation methods and we evaluate recent LLMs (e.g., o1-preview, GPT-4, Claude 3.5 Sonnet, and Gemini 2.0 Flash) for the tasks of detecting and correcting medical errors requiring both medical knowledge and reasoning capabilities. We also conducted a comparative study where two medical doctors performed the same task on the MEDEC test set. The results showed that MEDEC is a sufficiently challenging benchmark to assess the ability of models to validate existing or generated notes and to correct medical errors. We also found that although recent LLMs have a good performance in error detection and correction, they are still outperformed by medical doctors in these tasks. We discuss the potential factors behind this gap, the insights from our experiments, the limitations of current evaluation metrics, and share potential pointers for future research.
Paper Structure (13 sections, 3 figures, 5 tables)

This paper contains 13 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Examples from the MEDEC dataset.
  • Figure 2: Method #1: Correct answer injected in the question text to create the reference note. The same process was used to inject a selected incorrect answer and to create another version of the note containing a medical error.
  • Figure 3: Error Type Distribution in the MEDEC Dataset.