Table of Contents
Fetching ...

Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for Radiology Reports

Qingqing Zhu, Xiuying Chen, Qiao Jin, Benjamin Hou, Tejas Sudharshan Mathai, Pritam Mukherjee, Xin Gao, Ronald M Summers, Zhiyong Lu

TL;DR

This work tackles the challenge of evaluating AI-generated radiology reports by integrating professional radiologist expertise with large language models (GPT-3.5 and GPT-4) through In-Context Instruction Learning (ICIL) and Chain-of-Thought (CoT) reasoning. A three-part framework combines sentence-level entailment scoring with Overall Score Regression and Iterative Verification of explanations to produce explainable, radiologist-aligned evaluations. The approach demonstrates superior alignment with expert judgments, evidenced by a Detailed GPT-4 (5-shot) score of $0.48$ (outperforming METEOR by $0.19$) and a Regressed GPT-4 Kendall's Tau of $0.64$, $0.35$ higher than METEOR's best $0.29$. These results support the potential of radiologist-guided, explainable LLM evaluation to improve the accuracy and trustworthiness of AI-generated medical reports, with public release of radiologist annotations to set a new standard for future assessments.

Abstract

In radiology, Artificial Intelligence (AI) has significantly advanced report generation, but automatic evaluation of these AI-produced reports remains challenging. Current metrics, such as Conventional Natural Language Generation (NLG) and Clinical Efficacy (CE), often fall short in capturing the semantic intricacies of clinical contexts or overemphasize clinical details, undermining report clarity. To overcome these issues, our proposed method synergizes the expertise of professional radiologists with Large Language Models (LLMs), like GPT-3.5 and GPT-4 1. Utilizing In-Context Instruction Learning (ICIL) and Chain of Thought (CoT) reasoning, our approach aligns LLM evaluations with radiologist standards, enabling detailed comparisons between human and AI generated reports. This is further enhanced by a Regression model that aggregates sentence evaluation scores. Experimental results show that our "Detailed GPT-4 (5-shot)" model achieves a 0.48 score, outperforming the METEOR metric by 0.19, while our "Regressed GPT-4" model shows even greater alignment with expert evaluations, exceeding the best existing metric by a 0.35 margin. Moreover, the robustness of our explanations has been validated through a thorough iterative strategy. We plan to publicly release annotations from radiology experts, setting a new standard for accuracy in future assessments. This underscores the potential of our approach in enhancing the quality assessment of AI-driven medical reports.

Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for Radiology Reports

TL;DR

This work tackles the challenge of evaluating AI-generated radiology reports by integrating professional radiologist expertise with large language models (GPT-3.5 and GPT-4) through In-Context Instruction Learning (ICIL) and Chain-of-Thought (CoT) reasoning. A three-part framework combines sentence-level entailment scoring with Overall Score Regression and Iterative Verification of explanations to produce explainable, radiologist-aligned evaluations. The approach demonstrates superior alignment with expert judgments, evidenced by a Detailed GPT-4 (5-shot) score of (outperforming METEOR by ) and a Regressed GPT-4 Kendall's Tau of , higher than METEOR's best . These results support the potential of radiologist-guided, explainable LLM evaluation to improve the accuracy and trustworthiness of AI-generated medical reports, with public release of radiologist annotations to set a new standard for future assessments.

Abstract

In radiology, Artificial Intelligence (AI) has significantly advanced report generation, but automatic evaluation of these AI-produced reports remains challenging. Current metrics, such as Conventional Natural Language Generation (NLG) and Clinical Efficacy (CE), often fall short in capturing the semantic intricacies of clinical contexts or overemphasize clinical details, undermining report clarity. To overcome these issues, our proposed method synergizes the expertise of professional radiologists with Large Language Models (LLMs), like GPT-3.5 and GPT-4 1. Utilizing In-Context Instruction Learning (ICIL) and Chain of Thought (CoT) reasoning, our approach aligns LLM evaluations with radiologist standards, enabling detailed comparisons between human and AI generated reports. This is further enhanced by a Regression model that aggregates sentence evaluation scores. Experimental results show that our "Detailed GPT-4 (5-shot)" model achieves a 0.48 score, outperforming the METEOR metric by 0.19, while our "Regressed GPT-4" model shows even greater alignment with expert evaluations, exceeding the best existing metric by a 0.35 margin. Moreover, the robustness of our explanations has been validated through a thorough iterative strategy. We plan to publicly release annotations from radiology experts, setting a new standard for accuracy in future assessments. This underscores the potential of our approach in enhancing the quality assessment of AI-driven medical reports.
Paper Structure (16 sections, 2 equations, 3 figures, 8 tables)

This paper contains 16 sections, 2 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: The whole architecture of our evaluation strategy. It is primarily focused on three key areas: In-context Instruction Learning, Overall Score Regression and Iterative Verification. The "sentence score" within the template represents the entailment score, derived by comparing each sentence from the original reports with its corresponding sentence in the prediction. An explanation for this is provided in the lower right corner of the figure.
  • Figure 2: Correlation matrix of Kendall's Tau Values for Metric Pairs. All scores have p value < 0.05.
  • Figure 3: Correlation matrix depicting Cohen's Kappa scores for different annotation methods when aggregating sentence scores.