Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for Radiology Reports

Qingqing Zhu; Xiuying Chen; Qiao Jin; Benjamin Hou; Tejas Sudharshan Mathai; Pritam Mukherjee; Xin Gao; Ronald M Summers; Zhiyong Lu

Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for Radiology Reports

Qingqing Zhu, Xiuying Chen, Qiao Jin, Benjamin Hou, Tejas Sudharshan Mathai, Pritam Mukherjee, Xin Gao, Ronald M Summers, Zhiyong Lu

TL;DR

This work tackles the challenge of evaluating AI-generated radiology reports by integrating professional radiologist expertise with large language models (GPT-3.5 and GPT-4) through In-Context Instruction Learning (ICIL) and Chain-of-Thought (CoT) reasoning. A three-part framework combines sentence-level entailment scoring with Overall Score Regression and Iterative Verification of explanations to produce explainable, radiologist-aligned evaluations. The approach demonstrates superior alignment with expert judgments, evidenced by a Detailed GPT-4 (5-shot) score of $0.48$ (outperforming METEOR by $0.19$) and a Regressed GPT-4 Kendall's Tau of $0.64$, $0.35$ higher than METEOR's best $0.29$. These results support the potential of radiologist-guided, explainable LLM evaluation to improve the accuracy and trustworthiness of AI-generated medical reports, with public release of radiologist annotations to set a new standard for future assessments.

Abstract

In radiology, Artificial Intelligence (AI) has significantly advanced report generation, but automatic evaluation of these AI-produced reports remains challenging. Current metrics, such as Conventional Natural Language Generation (NLG) and Clinical Efficacy (CE), often fall short in capturing the semantic intricacies of clinical contexts or overemphasize clinical details, undermining report clarity. To overcome these issues, our proposed method synergizes the expertise of professional radiologists with Large Language Models (LLMs), like GPT-3.5 and GPT-4 1. Utilizing In-Context Instruction Learning (ICIL) and Chain of Thought (CoT) reasoning, our approach aligns LLM evaluations with radiologist standards, enabling detailed comparisons between human and AI generated reports. This is further enhanced by a Regression model that aggregates sentence evaluation scores. Experimental results show that our "Detailed GPT-4 (5-shot)" model achieves a 0.48 score, outperforming the METEOR metric by 0.19, while our "Regressed GPT-4" model shows even greater alignment with expert evaluations, exceeding the best existing metric by a 0.35 margin. Moreover, the robustness of our explanations has been validated through a thorough iterative strategy. We plan to publicly release annotations from radiology experts, setting a new standard for accuracy in future assessments. This underscores the potential of our approach in enhancing the quality assessment of AI-driven medical reports.

Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for Radiology Reports

TL;DR

(outperforming METEOR by

) and a Regressed GPT-4 Kendall's Tau of

higher than METEOR's best

. These results support the potential of radiologist-guided, explainable LLM evaluation to improve the accuracy and trustworthiness of AI-generated medical reports, with public release of radiologist annotations to set a new standard for future assessments.

Abstract

Paper Structure (16 sections, 2 equations, 3 figures, 8 tables)

This paper contains 16 sections, 2 equations, 3 figures, 8 tables.

Introduction
Related Work
Evaluation Metrics in Radiology Reports
LLMs for Evaluation
Method
In-context Instruction Learning
Overall Score Regression
Iterative Verification of the Explanatory Mechanisms
Experiments
Results and Analyses
Discussion
Conclusion
Limitations
Ethical Statement
Acknowledgements
...and 1 more sections

Figures (3)

Figure 1: The whole architecture of our evaluation strategy. It is primarily focused on three key areas: In-context Instruction Learning, Overall Score Regression and Iterative Verification. The "sentence score" within the template represents the entailment score, derived by comparing each sentence from the original reports with its corresponding sentence in the prediction. An explanation for this is provided in the lower right corner of the figure.
Figure 2: Correlation matrix of Kendall's Tau Values for Metric Pairs. All scores have p value < 0.05.
Figure 3: Correlation matrix depicting Cohen's Kappa scores for different annotation methods when aggregating sentence scores.

Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for Radiology Reports

TL;DR

Abstract

Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for Radiology Reports

Authors

TL;DR

Abstract

Table of Contents

Figures (3)