Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for Radiology Reports
Qingqing Zhu, Xiuying Chen, Qiao Jin, Benjamin Hou, Tejas Sudharshan Mathai, Pritam Mukherjee, Xin Gao, Ronald M Summers, Zhiyong Lu
TL;DR
This work tackles the challenge of evaluating AI-generated radiology reports by integrating professional radiologist expertise with large language models (GPT-3.5 and GPT-4) through In-Context Instruction Learning (ICIL) and Chain-of-Thought (CoT) reasoning. A three-part framework combines sentence-level entailment scoring with Overall Score Regression and Iterative Verification of explanations to produce explainable, radiologist-aligned evaluations. The approach demonstrates superior alignment with expert judgments, evidenced by a Detailed GPT-4 (5-shot) score of $0.48$ (outperforming METEOR by $0.19$) and a Regressed GPT-4 Kendall's Tau of $0.64$, $0.35$ higher than METEOR's best $0.29$. These results support the potential of radiologist-guided, explainable LLM evaluation to improve the accuracy and trustworthiness of AI-generated medical reports, with public release of radiologist annotations to set a new standard for future assessments.
Abstract
In radiology, Artificial Intelligence (AI) has significantly advanced report generation, but automatic evaluation of these AI-produced reports remains challenging. Current metrics, such as Conventional Natural Language Generation (NLG) and Clinical Efficacy (CE), often fall short in capturing the semantic intricacies of clinical contexts or overemphasize clinical details, undermining report clarity. To overcome these issues, our proposed method synergizes the expertise of professional radiologists with Large Language Models (LLMs), like GPT-3.5 and GPT-4 1. Utilizing In-Context Instruction Learning (ICIL) and Chain of Thought (CoT) reasoning, our approach aligns LLM evaluations with radiologist standards, enabling detailed comparisons between human and AI generated reports. This is further enhanced by a Regression model that aggregates sentence evaluation scores. Experimental results show that our "Detailed GPT-4 (5-shot)" model achieves a 0.48 score, outperforming the METEOR metric by 0.19, while our "Regressed GPT-4" model shows even greater alignment with expert evaluations, exceeding the best existing metric by a 0.35 margin. Moreover, the robustness of our explanations has been validated through a thorough iterative strategy. We plan to publicly release annotations from radiology experts, setting a new standard for accuracy in future assessments. This underscores the potential of our approach in enhancing the quality assessment of AI-driven medical reports.
