Multimodal Human-AI Synergy for Medical Imaging Quality Control: A Hybrid Intelligence Framework with Adaptive Dataset Curation and Closed-Loop Evaluation
Zhi Qin, Qianhui Gui, Mouxiao Bian, Rui Wang, Hong Ge, Dandan Yao, Ziying Sun, Yuan Zhao, Yu Zhang, Hui Shi, Dongdong Wang, Chenxin Song, Shenghong Ju, Lihao Liu, Junjun He, Jie Xu, Yuan-Cheng Wang
TL;DR
This work introduces a standardized multimodal framework for medical imaging quality control and systematically evaluates a broad set of large language models on CXR image quality assessment and CT report auditing within a Medbench-driven pipeline. By assembling 161 CXR images and 219 CT reports with rigorous de-identification and radiologist oversight, the study defines 11 CXR QC criteria and eight CT report error types, enabling objective evaluation via Macro-F1 and Micro-F1 metrics. Key findings show Gemini 2.0-Flash excels in CXR QC generalization (Macro-F1 ≈ 90) while DeepSeek-R1 leads CT report QC performance (recall ≈ 62.23%), with trade-offs between precision and discovery across models. The work demonstrates the feasibility and value of standardized QC benchmarks for AI-assisted radiology, while outlining practical directions for improving model robustness, multi-language support, and cross-institution validation to impact clinical workflows.
Abstract
Medical imaging quality control (QC) is essential for accurate diagnosis, yet traditional QC methods remain labor-intensive and subjective. To address this challenge, in this study, we establish a standardized dataset and evaluation framework for medical imaging QC, systematically assessing large language models (LLMs) in image quality assessment and report standardization. Specifically, we first constructed and anonymized a dataset of 161 chest X-ray (CXR) radiographs and 219 CT reports for evaluation. Then, multiple LLMs, including Gemini 2.0-Flash, GPT-4o, and DeepSeek-R1, were evaluated based on recall, precision, and F1 score to detect technical errors and inconsistencies. Experimental results show that Gemini 2.0-Flash achieved a Macro F1 score of 90 in CXR tasks, demonstrating strong generalization but limited fine-grained performance. DeepSeek-R1 excelled in CT report auditing with a 62.23\% recall rate, outperforming other models. However, its distilled variants performed poorly, while InternLM2.5-7B-chat exhibited the highest additional discovery rate, indicating broader but less precise error detection. These findings highlight the potential of LLMs in medical imaging QC, with DeepSeek-R1 and Gemini 2.0-Flash demonstrating superior performance.
