Table of Contents
Fetching ...

Vision-Language Models for Automated 3D PET/CT Report Generation

Wenpei Jiao, Kun Shang, Hui Li, Ke Yan, Jiajin Zhang, Guangjie Yang, Lijuan Guo, Yan Wan, Xing Yang, Dakai Jin, Zhaoheng Xie

TL;DR

This work tackles automated 3D PET/CT report generation by introducing PETRG-3D, a dual-stream volumetric framework that jointly encodes PET metabolic activity and CT anatomy. It blends style-aware prompting with hospital- and gender-specific templates (SAMF) and uses parameter-efficient LoRA fine-tuning to generate clinically coherent reports. The authors curate PETRG-Lym, a multicenter lymphoma dataset, and AutoPET-RG-Lym as an external benchmark, along with PETRG-Score for clinically grounded evaluation that jointly assesses uptake and structural findings. Results show substantial gains in natural language quality and clinical fidelity over existing baselines, while revealing challenges in cross-center CT style generalization and the need for longitudinal and quantitatively precise reporting. This work lays a foundation for disease-aware, multimodal PET/CT report generation and provides publicly available datasets and benchmarks to accelerate future research.

Abstract

Positron emission tomography/computed tomography (PET/CT) is essential in oncology, yet the rapid expansion of scanners has outpaced the availability of trained specialists, making automated PET/CT report generation (PETRG) increasingly important for reducing clinical workload. Compared with structural imaging (e.g., X-ray, CT, and MRI), functional PET poses distinct challenges: metabolic patterns vary with tracer physiology, and whole-body 3D contextual information is required rather than local-region interpretation. To advance PETRG, we propose PETRG-3D, an end-to-end 3D dual-branch framework that separately encodes PET and CT volumes and incorporates style-adaptive prompts to mitigate inter-hospital variability in reporting practices. We construct PETRG-Lym, a multi-center lymphoma dataset collected from four hospitals (824 reports w/ 245,509 paired PET/CT slices), and construct AutoPET-RG-Lym, a publicly accessible PETRG benchmark derived from open imaging data but equipped with new expert-written, clinically validated reports (135 cases). To assess clinical utility, we introduce PETRG-Score, a lymphoma-specific evaluation protocol that jointly measures metabolic and structural findings across curated anatomical regions. Experiments show that PETRG-3D substantially outperforms existing methods on both natural language metrics (e.g., +31.49\% ROUGE-L) and clinical efficacy metrics (e.g., +8.18\% PET-All), highlighting the benefits of volumetric dual-modality modeling and style-aware prompting. Overall, this work establishes a foundation for future PET/CT-specific models emphasizing disease-aware reasoning and clinically reliable evaluation. Codes, models, and AutoPET-RG-Lym will be released.

Vision-Language Models for Automated 3D PET/CT Report Generation

TL;DR

This work tackles automated 3D PET/CT report generation by introducing PETRG-3D, a dual-stream volumetric framework that jointly encodes PET metabolic activity and CT anatomy. It blends style-aware prompting with hospital- and gender-specific templates (SAMF) and uses parameter-efficient LoRA fine-tuning to generate clinically coherent reports. The authors curate PETRG-Lym, a multicenter lymphoma dataset, and AutoPET-RG-Lym as an external benchmark, along with PETRG-Score for clinically grounded evaluation that jointly assesses uptake and structural findings. Results show substantial gains in natural language quality and clinical fidelity over existing baselines, while revealing challenges in cross-center CT style generalization and the need for longitudinal and quantitatively precise reporting. This work lays a foundation for disease-aware, multimodal PET/CT report generation and provides publicly available datasets and benchmarks to accelerate future research.

Abstract

Positron emission tomography/computed tomography (PET/CT) is essential in oncology, yet the rapid expansion of scanners has outpaced the availability of trained specialists, making automated PET/CT report generation (PETRG) increasingly important for reducing clinical workload. Compared with structural imaging (e.g., X-ray, CT, and MRI), functional PET poses distinct challenges: metabolic patterns vary with tracer physiology, and whole-body 3D contextual information is required rather than local-region interpretation. To advance PETRG, we propose PETRG-3D, an end-to-end 3D dual-branch framework that separately encodes PET and CT volumes and incorporates style-adaptive prompts to mitigate inter-hospital variability in reporting practices. We construct PETRG-Lym, a multi-center lymphoma dataset collected from four hospitals (824 reports w/ 245,509 paired PET/CT slices), and construct AutoPET-RG-Lym, a publicly accessible PETRG benchmark derived from open imaging data but equipped with new expert-written, clinically validated reports (135 cases). To assess clinical utility, we introduce PETRG-Score, a lymphoma-specific evaluation protocol that jointly measures metabolic and structural findings across curated anatomical regions. Experiments show that PETRG-3D substantially outperforms existing methods on both natural language metrics (e.g., +31.49\% ROUGE-L) and clinical efficacy metrics (e.g., +8.18\% PET-All), highlighting the benefits of volumetric dual-modality modeling and style-aware prompting. Overall, this work establishes a foundation for future PET/CT-specific models emphasizing disease-aware reasoning and clinically reliable evaluation. Codes, models, and AutoPET-RG-Lym will be released.

Paper Structure

This paper contains 47 sections, 5 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: A glimpse of this work. (1) Data: The dataset consists of a private part collected from four medical centers in China and a public part derived from the AutoPET dataset gatidis2023autopet, for which reports were generated and subsequently revised and verified by two senior nuclear medicine physicians. (2) Method: An end-to-end 3D PET/CT report-generation framework that integrates dual-modality volumetric encoding and hospital-specific prompting to generate clinically coherent reports. (3) Evaluation: A clinical efficacy evaluation pipeline that quantifies both metabolic (PET) and anatomical (CT) accuracy of generated reports.
  • Figure 1: Preprocessing pipeline for PET/CT images and reports.
  • Figure 2: Overall framework of PETRG-3D. The model comprises three key components: (1) a dual-stream 3D visual feature extractor (DSFE) that processes PET and CT volumes separately (blue); (2) a style-adaptive multimodal fusion (SAMF) module that dynamically integrates visual features with hospital-specific prompt templates (yellow); and (3) a LoRA-adapted large language model (LLM) for effective radiology report generation (pink).
  • Figure 2: Illustration of region-level image-report pairs in PET/CT.
  • Figure 3: Qualitative comparison of our method against PET2Repzhang2025pet2rep on chest results. Different colors denote distinct anatomical areas. Incorrect diagnoses are highlighted in gray.
  • ...and 7 more figures