Table of Contents
Fetching ...

ReXErr: Synthesizing Clinically Meaningful Errors in Diagnostic Radiology Reports

Vishwanatha M. Rao, Serena Zhang, Julian N. Acosta, Subathra Adithan, Pranav Rajpurkar

TL;DR

ReXErr addresses the challenge of evaluating and improving radiology report accuracy by synthesizing clinically plausible errors in chest X-ray reports using Large Language Models. The authors define a 12-category error taxonomy spanning AI- and human-generated mistakes, and implement a data synthesis pipeline with context-aware sampling to generate errorful ground-truth and sentence-level data. They validate the approach via cross-dataset sampling frequencies and clinician assessment, finding high plausibility (83/100). The work provides a scalable resource to train and evaluate error detection and correction algorithms, with potential to improve radiology report reliability.

Abstract

Accurately interpreting medical images and writing radiology reports is a critical but challenging task in healthcare. Both human-written and AI-generated reports can contain errors, ranging from clinical inaccuracies to linguistic mistakes. To address this, we introduce ReXErr, a methodology that leverages Large Language Models to generate representative errors within chest X-ray reports. Working with board-certified radiologists, we developed error categories that capture common mistakes in both human and AI-generated reports. Our approach uses a novel sampling scheme to inject diverse errors while maintaining clinical plausibility. ReXErr demonstrates consistency across error categories and produces errors that closely mimic those found in real-world scenarios. This method has the potential to aid in the development and evaluation of report correction algorithms, potentially enhancing the quality and reliability of radiology reporting.

ReXErr: Synthesizing Clinically Meaningful Errors in Diagnostic Radiology Reports

TL;DR

ReXErr addresses the challenge of evaluating and improving radiology report accuracy by synthesizing clinically plausible errors in chest X-ray reports using Large Language Models. The authors define a 12-category error taxonomy spanning AI- and human-generated mistakes, and implement a data synthesis pipeline with context-aware sampling to generate errorful ground-truth and sentence-level data. They validate the approach via cross-dataset sampling frequencies and clinician assessment, finding high plausibility (83/100). The work provides a scalable resource to train and evaluate error detection and correction algorithms, with potential to improve radiology report reliability.

Abstract

Accurately interpreting medical images and writing radiology reports is a critical but challenging task in healthcare. Both human-written and AI-generated reports can contain errors, ranging from clinical inaccuracies to linguistic mistakes. To address this, we introduce ReXErr, a methodology that leverages Large Language Models to generate representative errors within chest X-ray reports. Working with board-certified radiologists, we developed error categories that capture common mistakes in both human and AI-generated reports. Our approach uses a novel sampling scheme to inject diverse errors while maintaining clinical plausibility. ReXErr demonstrates consistency across error categories and produces errors that closely mimic those found in real-world scenarios. This method has the potential to aid in the development and evaluation of report correction algorithms, potentially enhancing the quality and reliability of radiology reporting.
Paper Structure (17 sections, 4 equations, 1 figure, 5 tables)

This paper contains 17 sections, 4 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Summary of ReXErr error generation pipeline. The bottom panel provides an example of applying ReXErr to a sample radiology report.