Quality Control for Radiology Report Generation Models via Auxiliary Auditing Components

Hermione Warr; Yasin Ibrahim; Daniel R. McGowan; Konstantinos Kamnitsas

Quality Control for Radiology Report Generation Models via Auxiliary Auditing Components

Hermione Warr, Yasin Ibrahim, Daniel R. McGowan, Konstantinos Kamnitsas

TL;DR

The paper tackles semantic inaccuracies in AI-generated radiology reports by introducing a modular auditing framework built around auxiliary auditing components (AC) that elicit disease signals from both images and text. The GenX report generator produces chest X-ray reports, which are audited against image-derived disease labels $C_I$ and text-derived labels $C_T$ under the consistency rule $(C_I=C_T) \land p_{AC}(c=C_I|I) \ge t$, with optional deferral for low confidence. Experiments on MIMIC-CXR show that auditing improves disease-semantic F1 from baseline GenX levels to as high as $\approx$58.4 with $t=0.8$, with per-disease ACs outperforming a single multi-label AC, confirming the value of modular redundancy for reliability. The findings support a practical quality-control pathway for clinical deployment of radiology report generation and suggest generalization to other semantic concepts beyond disease classification.

Abstract

Automation of medical image interpretation could alleviate bottlenecks in diagnostic workflows, and has become of particular interest in recent years due to advancements in natural language processing. Great strides have been made towards automated radiology report generation via AI, yet ensuring clinical accuracy in generated reports is a significant challenge, hindering deployment of such methods in clinical practice. In this work we propose a quality control framework for assessing the reliability of AI-generated radiology reports with respect to semantics of diagnostic importance using modular auxiliary auditing components (AC). Evaluating our pipeline on the MIMIC-CXR dataset, our findings show that incorporating ACs in the form of disease-classifiers can enable auditing that identifies more reliable reports, resulting in higher F1 scores compared to unfiltered generated reports. Additionally, leveraging the confidence of the AC labels further improves the audit's effectiveness.

Quality Control for Radiology Report Generation Models via Auxiliary Auditing Components

TL;DR

and text-derived labels

under the consistency rule

, with optional deferral for low confidence. Experiments on MIMIC-CXR show that auditing improves disease-semantic F1 from baseline GenX levels to as high as

58.4 with

, with per-disease ACs outperforming a single multi-label AC, confirming the value of modular redundancy for reliability. The findings support a practical quality-control pathway for clinical deployment of radiology report generation and suggest generalization to other semantic concepts beyond disease classification.

Abstract

Paper Structure (10 sections, 2 equations, 2 figures, 3 tables)

This paper contains 10 sections, 2 equations, 2 figures, 3 tables.

Introduction
Methodology
Report Generation Model - GenX
Auditing Generated Reports via Auxiliary Classifiers
Experiments and Results
Data
Evaluating the Report Generation Model
Evaluating the Auditing Framework for Generated Reports
Conclusion
Acknowledgements

Figures (2)

Figure 1: Proposed error detection pipeline of radiology report generation using auxiliary auditing components. The standard report generation pipeline ($a$) is followed by the CheXbert labeler, $g_T$ ($b$) that extracts pathology labels, $C_T$, from the reports, that are semantically meaningful for clinical diagnosis. ($c$) Modular image-based audit models, AC, here disease classifiers, predict disease class labels, $C_I$, based on image, $I$. ($d$) If the labels predicted based on image ($C_I$) and report ($C_T$) are consistent, the report's contents are deemed likely reliable. In case of inconsistency, $C_I\neq C_T$, the report is flagged as less reliable, potentially containing an error. If the image-based classifiers have predictive confidence below threshold $t$, auditing can be deferred to user due to uncertainty.
Figure 2: Illustration of one inference step for radiology report generation (a). Embeddings of image and previously generated text tokens are passed to the autoregressive language model ($b$) which outputs a probability distribution over the vocabulary, to predict the next word in the sequence.

Quality Control for Radiology Report Generation Models via Auxiliary Auditing Components

TL;DR

Abstract

Quality Control for Radiology Report Generation Models via Auxiliary Auditing Components

Authors

TL;DR

Abstract

Table of Contents

Figures (2)