Table of Contents
Fetching ...

Semantic Similarity in Radiology Reports via LLMs and NER

Beth Pearson, Ahmed Adnan, Zahraa S. Abdallah

TL;DR

This paper tackles the challenge of semantically comparing preliminary and final radiology reports to aid junior radiologists in training and quality assurance. It introduces Llama-EntScore, a hybrid method that merges NER-grounded entity extraction with a context-aware LLM assessment, yielding a numeric similarity score and an interpretable explanation. The approach uses a tunable Entity-Based Semantic Agreement Score (ESAS) to classify differences as matched, mismatched, missing, or surplus, and anchors LLM explanations to the numerical score. On a dataset of 115 anonymised report pairs, Llama-EntScore achieves 67% exact-match accuracy and 93% accuracy within ±1 of radiologist ground-truth scores, outperforming both standalone LLMs and pure NER approaches, with practical implications for scalable, local, explainable feedback in radiology reporting. The work highlights the potential and limitations of hybrid AI systems in clinical NLP, and points to future directions in expanding categories, refining NER grounding, and reducing computation to enable broader clinical deployment.

Abstract

Radiology report evaluation is a crucial part of radiologists' training and plays a key role in ensuring diagnostic accuracy. As part of the standard reporting workflow, a junior radiologist typically prepares a preliminary report, which is then reviewed and edited by a senior radiologist to produce the final report. Identifying semantic differences between preliminary and final reports is essential for junior doctors, both as a training tool and to help uncover gaps in clinical knowledge. While AI in radiology is a rapidly growing field, the application of large language models (LLMs) remains challenging due to the need for specialised domain knowledge. In this paper, we explore the ability of LLMs to provide explainable and accurate comparisons of reports in the radiology domain. We begin by comparing the performance of several LLMs in comparing radiology reports. We then assess a more traditional approach based on Named-Entity-Recognition (NER). However, both approaches exhibit limitations in delivering accurate feedback on semantic similarity. To address this, we propose Llama-EntScore, a semantic similarity scoring method using a combination of Llama 3.1 and NER with tunable weights to emphasise or de-emphasise specific types of differences. Our approach generates a quantitative similarity score for tracking progress and also gives an interpretation of the score that aims to offer valuable guidance in reviewing and refining their reporting. We find our method achieves 67% exact-match accuracy and 93% accuracy within +/- 1 when compared to radiologist-provided ground truth scores - outperforming both LLMs and NER used independently. Code is available at: https://github.com/otmive/llama_reports

Semantic Similarity in Radiology Reports via LLMs and NER

TL;DR

This paper tackles the challenge of semantically comparing preliminary and final radiology reports to aid junior radiologists in training and quality assurance. It introduces Llama-EntScore, a hybrid method that merges NER-grounded entity extraction with a context-aware LLM assessment, yielding a numeric similarity score and an interpretable explanation. The approach uses a tunable Entity-Based Semantic Agreement Score (ESAS) to classify differences as matched, mismatched, missing, or surplus, and anchors LLM explanations to the numerical score. On a dataset of 115 anonymised report pairs, Llama-EntScore achieves 67% exact-match accuracy and 93% accuracy within ±1 of radiologist ground-truth scores, outperforming both standalone LLMs and pure NER approaches, with practical implications for scalable, local, explainable feedback in radiology reporting. The work highlights the potential and limitations of hybrid AI systems in clinical NLP, and points to future directions in expanding categories, refining NER grounding, and reducing computation to enable broader clinical deployment.

Abstract

Radiology report evaluation is a crucial part of radiologists' training and plays a key role in ensuring diagnostic accuracy. As part of the standard reporting workflow, a junior radiologist typically prepares a preliminary report, which is then reviewed and edited by a senior radiologist to produce the final report. Identifying semantic differences between preliminary and final reports is essential for junior doctors, both as a training tool and to help uncover gaps in clinical knowledge. While AI in radiology is a rapidly growing field, the application of large language models (LLMs) remains challenging due to the need for specialised domain knowledge. In this paper, we explore the ability of LLMs to provide explainable and accurate comparisons of reports in the radiology domain. We begin by comparing the performance of several LLMs in comparing radiology reports. We then assess a more traditional approach based on Named-Entity-Recognition (NER). However, both approaches exhibit limitations in delivering accurate feedback on semantic similarity. To address this, we propose Llama-EntScore, a semantic similarity scoring method using a combination of Llama 3.1 and NER with tunable weights to emphasise or de-emphasise specific types of differences. Our approach generates a quantitative similarity score for tracking progress and also gives an interpretation of the score that aims to offer valuable guidance in reviewing and refining their reporting. We find our method achieves 67% exact-match accuracy and 93% accuracy within +/- 1 when compared to radiologist-provided ground truth scores - outperforming both LLMs and NER used independently. Code is available at: https://github.com/otmive/llama_reports

Paper Structure

This paper contains 12 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Examples of small negation changes in report impression sections.
  • Figure 2: Distribution of similarity scores for all methods. Llama-EntScore most closely follows the shape and spread of expert-annotated ground truth scores.
  • Figure 3: Confusion matrices for predicted vs. ground truth scores. Llama-EntScore produces a tighter diagonal pattern, indicating more precise estimates.
  • Figure 4: Entity comparison visualisation. Colour codes: green = matched, yellow = mismatched, red = missing, blue = surplus.
  • Figure 5: Example LLM-generated explanation for a similarity score of 0.63 between a preliminary and final radiology report. The output highlights matched findings, similar conclusions, and minor wording differences.