Table of Contents
Fetching ...

PET-F2I: A Comprehensive Benchmark and Parameter-Efficient Fine-Tuning of LLMs for PET/CT Report Impression Generation

Yuchen Liu, Wenbo Zhang, Liling Peng, Yichi Zhang, Yu Fu, Xin Guo, Chao Qu, Yuan Qi, Le Xue

TL;DR

This work introduces PET-F2I-41K (PET Findings-to-Impression Benchmark), a large-scale benchmark for PET/CT impression generation using LLMs, constructed from over 41k real-world reports, and establishes a standardized evaluation framework to accelerate the development of reliable and clinically deployable reporting systems for PET/CT.

Abstract

PET/CT imaging is pivotal in oncology and nuclear medicine, yet summarizing complex findings into precise diagnostic impressions is labor-intensive. While LLMs have shown promise in medical text generation, their capability in the highly specialized domain of PET/CT remains underexplored. We introduce PET-F2I-41K (PET Findings-to-Impression Benchmark), a large-scale benchmark for PET/CT impression generation using LLMs, constructed from over 41k real-world reports. Using PET-F2I-41K, we conduct a comprehensive evaluation of 27 models across proprietary frontier LLMs, open-source generalist models, and medical-domain LLMs, and we develop a domain-adapted 7B model (PET-F2I-7B) fine-tuned from Qwen2.5-7B-Instruct via LoRA. Beyond standard NLG metrics (e.g., BLEU-4, ROUGE-L, BERTScore), we propose three clinically grounded metrics - Entity Coverage Rate (ECR), Uncovered Entity Rate (UER), and Factual Consistency Rate (FCR) - to assess diagnostic completeness and factual reliability. Experiments reveal that neither frontier nor medical-domain LLMs perform adequately in zero-shot settings. In contrast, PET-F2I-7B achieves substantial gains (e.g., 0.708 BLEU-4) and a 3.0x improvement in entity coverage over the strongest baseline, while offering advantages in cost, latency, and privacy. Beyond this modeling contribution, PET-F2I-41K establishes a standardized evaluation framework to accelerate the development of reliable and clinically deployable reporting systems for PET/CT.

PET-F2I: A Comprehensive Benchmark and Parameter-Efficient Fine-Tuning of LLMs for PET/CT Report Impression Generation

TL;DR

This work introduces PET-F2I-41K (PET Findings-to-Impression Benchmark), a large-scale benchmark for PET/CT impression generation using LLMs, constructed from over 41k real-world reports, and establishes a standardized evaluation framework to accelerate the development of reliable and clinically deployable reporting systems for PET/CT.

Abstract

PET/CT imaging is pivotal in oncology and nuclear medicine, yet summarizing complex findings into precise diagnostic impressions is labor-intensive. While LLMs have shown promise in medical text generation, their capability in the highly specialized domain of PET/CT remains underexplored. We introduce PET-F2I-41K (PET Findings-to-Impression Benchmark), a large-scale benchmark for PET/CT impression generation using LLMs, constructed from over 41k real-world reports. Using PET-F2I-41K, we conduct a comprehensive evaluation of 27 models across proprietary frontier LLMs, open-source generalist models, and medical-domain LLMs, and we develop a domain-adapted 7B model (PET-F2I-7B) fine-tuned from Qwen2.5-7B-Instruct via LoRA. Beyond standard NLG metrics (e.g., BLEU-4, ROUGE-L, BERTScore), we propose three clinically grounded metrics - Entity Coverage Rate (ECR), Uncovered Entity Rate (UER), and Factual Consistency Rate (FCR) - to assess diagnostic completeness and factual reliability. Experiments reveal that neither frontier nor medical-domain LLMs perform adequately in zero-shot settings. In contrast, PET-F2I-7B achieves substantial gains (e.g., 0.708 BLEU-4) and a 3.0x improvement in entity coverage over the strongest baseline, while offering advantages in cost, latency, and privacy. Beyond this modeling contribution, PET-F2I-41K establishes a standardized evaluation framework to accelerate the development of reliable and clinically deployable reporting systems for PET/CT.
Paper Structure (10 sections, 3 equations, 6 figures, 1 table)

This paper contains 10 sections, 3 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of the PET-F2I-41K framework. The pipeline encompasses clinical task formalization, comprehensive benchmarking of 27 LLMs, and multi-dimensional evaluation, demonstrating the absolute superiority of our domain-adapted PET-F2I-7B (+138% ECR, -75% UER).
  • Figure 2: (a) Tracer distribution in PET-F2I-41K (41,191 reports). (b, c) 27 models ranked by BLEU-4 and ECR; PET-F2I-7B ($\star$) consistently outperforms the performance of frontier proprietary, large-scale open-source, and specialized medical LLMs.
  • Figure 3: Score distributions across model categories ($N=500$). Beyond superior averages, PET-F2I-7B demonstrates exceptional clinical stability. Crucially, PET-F2I-7B's lowest quartile for exact entity coverage (ECR) strictly dominates the highest quartiles of all zero-shot baselines, underscoring its superior reliability for real-world deployment.
  • Figure 4: Correlation between NLG and clinical metrics. Despite strong macro-level correlations (e.g., ECR vs. BERTScore-F1: $r=0.889$), substantial intra-distribution variance reveals that high lexical overlap does not guarantee clinical accuracy. Models with identical NLG scores frequently exhibit drastically different entity omission rates, proving traditional metrics are insufficient proxies for diagnostic safety and highlighting the necessity of the PET-F2I framework.
  • Figure 5: Independence of clinical metrics and divergence from NLG scores.(a) Sample-level analysis demonstrates that entity coverage (ECR) and formatting compliance (FCR) are fundamentally orthogonal ($r=0.075$). (b) The correlation matrix reveals that standard lexical metrics exhibit near-zero correlation with clinical factuality (e.g., BLEU-4 vs. FCR: $r=0.28$), highlighting their inadequacy for safety evaluation.
  • ...and 1 more figures