Table of Contents
Fetching ...

Reasoning before Comparison: LLM-Enhanced Semantic Similarity Metrics for Domain Specialized Text Analysis

Shaochen Xu, Zihao Wu, Huaqin Zhao, Peng Shu, Zhengliang Liu, Wenxiong Liao, Sheng Li, Andrea Sikora, Tianming Liu, Xiang Li

TL;DR

The paper addresses the inadequacy of traditional lexical metrics in capturing semantic meaning in medical texts. It introduces a GPT-4–driven framework that generates clinical labels for radiology reports, evaluated against ground-truth CheXpert and NegBio annotations using semantic embeddings. Across 62,500 report-pair comparisons from MIMIC-CXR, GPT-4 generated labels show closer alignment to ground truth than ROUGE or BLEU, illustrating the value of semantic reasoning in domain-specific text analysis. The work highlights the potential of human-in-the-loop, domain-focused AI to advance precision medicine and healthcare informatics, while noting limitations such as the scope to chest radiology and the need for HITL integration in future studies.

Abstract

In this study, we leverage LLM to enhance the semantic analysis and develop similarity metrics for texts, addressing the limitations of traditional unsupervised NLP metrics like ROUGE and BLEU. We develop a framework where LLMs such as GPT-4 are employed for zero-shot text identification and label generation for radiology reports, where the labels are then used as measurements for text similarity. By testing the proposed framework on the MIMIC data, we find that GPT-4 generated labels can significantly improve the semantic similarity assessment, with scores more closely aligned with clinical ground truth than traditional NLP metrics. Our work demonstrates the possibility of conducting semantic analysis of the text data using semi-quantitative reasoning results by the LLMs for highly specialized domains. While the framework is implemented for radiology report similarity analysis, its concept can be extended to other specialized domains as well.

Reasoning before Comparison: LLM-Enhanced Semantic Similarity Metrics for Domain Specialized Text Analysis

TL;DR

The paper addresses the inadequacy of traditional lexical metrics in capturing semantic meaning in medical texts. It introduces a GPT-4–driven framework that generates clinical labels for radiology reports, evaluated against ground-truth CheXpert and NegBio annotations using semantic embeddings. Across 62,500 report-pair comparisons from MIMIC-CXR, GPT-4 generated labels show closer alignment to ground truth than ROUGE or BLEU, illustrating the value of semantic reasoning in domain-specific text analysis. The work highlights the potential of human-in-the-loop, domain-focused AI to advance precision medicine and healthcare informatics, while noting limitations such as the scope to chest radiology and the need for HITL integration in future studies.

Abstract

In this study, we leverage LLM to enhance the semantic analysis and develop similarity metrics for texts, addressing the limitations of traditional unsupervised NLP metrics like ROUGE and BLEU. We develop a framework where LLMs such as GPT-4 are employed for zero-shot text identification and label generation for radiology reports, where the labels are then used as measurements for text similarity. By testing the proposed framework on the MIMIC data, we find that GPT-4 generated labels can significantly improve the semantic similarity assessment, with scores more closely aligned with clinical ground truth than traditional NLP metrics. Our work demonstrates the possibility of conducting semantic analysis of the text data using semi-quantitative reasoning results by the LLMs for highly specialized domains. While the framework is implemented for radiology report similarity analysis, its concept can be extended to other specialized domains as well.
Paper Structure (17 sections, 3 figures, 1 table)

This paper contains 17 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of the prompt and response conversation between the user and GPT-4 to generate a list of custom labels for any given medical text. (1) The left side showcases the entire conversational flow between the user and GPT-4 to generate the desired labels. (2) While the right side displays the products of said conversation from GPT-4's response.
  • Figure 2: Computational pipeline of the similarity difference calculation between two radiology reports from the MIMIC-CXR dataset used in our experimental settings.
  • Figure 3: Comparison of predicted similarities using various methods across the 62,500 text pairs. The dashed line denotes the ideal prediction-ground truth (GT) match across the 5th to 95th percentile range of GT similarities. A closer alignment to the dashed line signifies a greater correspondence with the GT. For clarity, only hexagonal bins with more than 100 observations are displayed. The results demonstrate that GPT_sim exhibits the highest degree of alignment with the GT.