Table of Contents
Fetching ...

Automatic Personalized Impression Generation for PET Reports Using Large Language Models

Xin Tie, Muheon Shin, Ali Pirasteh, Nevein Ibrahim, Zachary Huemann, Sharon M. Castellino, Kara M. Kelly, John Garrett, Junjie Hu, Steve Y. Cho, Tyler J. Bradshaw

TL;DR

This study tackles the challenge of generating accurate, personalized impressions for whole-body PET reports by fine-tuning a broad set of large language models on a large PET report corpus and conditioning outputs on a reading physician's identity. It identifies domain-adapted evaluation metrics (BARTScore-PET and PEGASUSScore-PET) that best align with physician judgments and selects PEGASUS as the top model, further validating its clinical utility through expert reader evaluation. The results show that PEGASUS impressions are largely clinically acceptable, with high utility when tailored to individual physicians, and can even support Deauville score predictions with strong accuracy. The work demonstrates the feasibility of integrating personalized impression drafting into PET reporting workflows, potentially speeding up report generation while emphasizing the need for human review to mitigate factual or interpretive errors.

Abstract

In this study, we aimed to determine if fine-tuned large language models (LLMs) can generate accurate, personalized impressions for whole-body PET reports. Twelve language models were trained on a corpus of PET reports using the teacher-forcing algorithm, with the report findings as input and the clinical impressions as reference. An extra input token encodes the reading physician's identity, allowing models to learn physician-specific reporting styles. Our corpus comprised 37,370 retrospective PET reports collected from our institution between 2010 and 2022. To identify the best LLM, 30 evaluation metrics were benchmarked against quality scores from two nuclear medicine (NM) physicians, with the most aligned metrics selecting the model for expert evaluation. In a subset of data, model-generated impressions and original clinical impressions were assessed by three NM physicians according to 6 quality dimensions (3-point scale) and an overall utility score (5-point scale). Each physician reviewed 12 of their own reports and 12 reports from other physicians. Bootstrap resampling was used for statistical analysis. Of all evaluation metrics, domain-adapted BARTScore and PEGASUSScore showed the highest Spearman's rank correlations (0.568 and 0.563) with physician preferences. Based on these metrics, the fine-tuned PEGASUS model was selected as the top LLM. When physicians reviewed PEGASUS-generated impressions in their own style, 89% were considered clinically acceptable, with a mean utility score of 4.08 out of 5. Physicians rated these personalized impressions as comparable in overall utility to the impressions dictated by other physicians (4.03, P=0.41). In conclusion, personalized impressions generated by PEGASUS were clinically useful, highlighting its potential to expedite PET reporting.

Automatic Personalized Impression Generation for PET Reports Using Large Language Models

TL;DR

This study tackles the challenge of generating accurate, personalized impressions for whole-body PET reports by fine-tuning a broad set of large language models on a large PET report corpus and conditioning outputs on a reading physician's identity. It identifies domain-adapted evaluation metrics (BARTScore-PET and PEGASUSScore-PET) that best align with physician judgments and selects PEGASUS as the top model, further validating its clinical utility through expert reader evaluation. The results show that PEGASUS impressions are largely clinically acceptable, with high utility when tailored to individual physicians, and can even support Deauville score predictions with strong accuracy. The work demonstrates the feasibility of integrating personalized impression drafting into PET reporting workflows, potentially speeding up report generation while emphasizing the need for human review to mitigate factual or interpretive errors.

Abstract

In this study, we aimed to determine if fine-tuned large language models (LLMs) can generate accurate, personalized impressions for whole-body PET reports. Twelve language models were trained on a corpus of PET reports using the teacher-forcing algorithm, with the report findings as input and the clinical impressions as reference. An extra input token encodes the reading physician's identity, allowing models to learn physician-specific reporting styles. Our corpus comprised 37,370 retrospective PET reports collected from our institution between 2010 and 2022. To identify the best LLM, 30 evaluation metrics were benchmarked against quality scores from two nuclear medicine (NM) physicians, with the most aligned metrics selecting the model for expert evaluation. In a subset of data, model-generated impressions and original clinical impressions were assessed by three NM physicians according to 6 quality dimensions (3-point scale) and an overall utility score (5-point scale). Each physician reviewed 12 of their own reports and 12 reports from other physicians. Bootstrap resampling was used for statistical analysis. Of all evaluation metrics, domain-adapted BARTScore and PEGASUSScore showed the highest Spearman's rank correlations (0.568 and 0.563) with physician preferences. Based on these metrics, the fine-tuned PEGASUS model was selected as the top LLM. When physicians reviewed PEGASUS-generated impressions in their own style, 89% were considered clinically acceptable, with a mean utility score of 4.08 out of 5. Physicians rated these personalized impressions as comparable in overall utility to the impressions dictated by other physicians (4.03, P=0.41). In conclusion, personalized impressions generated by PEGASUS were clinically useful, highlighting its potential to expedite PET reporting.
Paper Structure (17 sections, 20 figures)

This paper contains 17 sections, 20 figures.

Figures (20)

  • Figure 1: Formatting of reports for input to encoder-decoder and decoder-only models. For encoder-decoder models, the first two lines describe the examination category and encode the reading physician’s identity, respectively. “Findings” contains the clinical findings from the PET report, and “Indication” includes the patient’s medical history and the reason for the examination. For decoder-only models, each case follows a specific format for the instruction: “Derive the impression from the given [description] for [physician]”. “Input” accommodates the concatenation of clinical findings and indications. The output always starts with the prefix “Response:”. Both model architectures utilize the cross-entropy loss to compute the difference between original clinical impressions and model-generated impressions.
  • Figure 1: All evaluation metrics included in this study and their respective categories.
  • Figure 2: Definitions of six quality dimensions and an overall utility score used in our expert evaluation, along with their corresponding Likert systems.
  • Figure 2: Spearman’s $\rho$ correlations between different evaluation metrics and quality scores assigned by the first physician. The top row quantifies the inter-reader correlation. Notably, domain-adapted BARTScore (BARTScore+PET) and PEGASUSScore (PEGASUSScore+PET) demonstrate the highest correlations with physician preferences.
  • Figure 3: Performance of 12 language models on Deauville score prediction
  • ...and 15 more figures