ReXErr: Synthesizing Clinically Meaningful Errors in Diagnostic Radiology Reports
Vishwanatha M. Rao, Serena Zhang, Julian N. Acosta, Subathra Adithan, Pranav Rajpurkar
TL;DR
ReXErr addresses the challenge of evaluating and improving radiology report accuracy by synthesizing clinically plausible errors in chest X-ray reports using Large Language Models. The authors define a 12-category error taxonomy spanning AI- and human-generated mistakes, and implement a data synthesis pipeline with context-aware sampling to generate errorful ground-truth and sentence-level data. They validate the approach via cross-dataset sampling frequencies and clinician assessment, finding high plausibility (83/100). The work provides a scalable resource to train and evaluate error detection and correction algorithms, with potential to improve radiology report reliability.
Abstract
Accurately interpreting medical images and writing radiology reports is a critical but challenging task in healthcare. Both human-written and AI-generated reports can contain errors, ranging from clinical inaccuracies to linguistic mistakes. To address this, we introduce ReXErr, a methodology that leverages Large Language Models to generate representative errors within chest X-ray reports. Working with board-certified radiologists, we developed error categories that capture common mistakes in both human and AI-generated reports. Our approach uses a novel sampling scheme to inject diverse errors while maintaining clinical plausibility. ReXErr demonstrates consistency across error categories and produces errors that closely mimic those found in real-world scenarios. This method has the potential to aid in the development and evaluation of report correction algorithms, potentially enhancing the quality and reliability of radiology reporting.
