Semantic Textual Similarity Assessment in Chest X-ray Reports Using a Domain-Specific Cosine-Based Metric
Sayeh Gholipour Picha, Dawood Al Chanti, Alice Caplier
TL;DR
This work introduces Medical Corpus Similarity Evaluation (MCSE), a domain-specific metric for assessing semantic similarity between generated chest X-ray reports and ground-truth references. It combines Clinical Entity Extraction with a Domain Similarity Evaluation, leveraging negations, descriptive modifiers, and a domain-aware cosine-like score computed through neural representations. The method is validated against radiology-annotated datasets and public chest X-ray corpora, and applied to evaluate recent report-generation models, demonstrating more meaningful semantic judgments than traditional metrics like BLEU. While showing promise, the study notes a bias in domain cosine similarity and suggests avenues to mitigate this bias in future work, with code and resources made publicly available.
Abstract
Medical language processing and deep learning techniques have emerged as critical tools for improving healthcare, particularly in the analysis of medical imaging and medical text data. These multimodal data fusion techniques help to improve the interpretation of medical imaging and lead to increased diagnostic accuracy, informed clinical decisions, and improved patient outcomes. The success of these models relies on the ability to extract and consolidate semantic information from clinical text. This paper addresses the need for more robust methods to evaluate the semantic content of medical reports. Conventional natural language processing approaches and metrics are initially designed for considering the semantic context in the natural language domain and machine translation, often failing to capture the complex semantic meanings inherent in medical content. In this study, we introduce a novel approach designed specifically for assessing the semantic similarity between generated medical reports and the ground truth. Our approach is validated, demonstrating its efficiency in assessing domain-specific semantic similarity within medical contexts. By applying our metric to state-of-the-art Chest X-ray report generation models, we obtain results that not only align with conventional metrics but also provide more contextually meaningful scores in the considered medical domain.
