Vision-Language Models for Automated 3D PET/CT Report Generation

Wenpei Jiao; Kun Shang; Hui Li; Ke Yan; Jiajin Zhang; Guangjie Yang; Lijuan Guo; Yan Wan; Xing Yang; Dakai Jin; Zhaoheng Xie

Vision-Language Models for Automated 3D PET/CT Report Generation

Wenpei Jiao, Kun Shang, Hui Li, Ke Yan, Jiajin Zhang, Guangjie Yang, Lijuan Guo, Yan Wan, Xing Yang, Dakai Jin, Zhaoheng Xie

TL;DR

This work tackles automated 3D PET/CT report generation by introducing PETRG-3D, a dual-stream volumetric framework that jointly encodes PET metabolic activity and CT anatomy. It blends style-aware prompting with hospital- and gender-specific templates (SAMF) and uses parameter-efficient LoRA fine-tuning to generate clinically coherent reports. The authors curate PETRG-Lym, a multicenter lymphoma dataset, and AutoPET-RG-Lym as an external benchmark, along with PETRG-Score for clinically grounded evaluation that jointly assesses uptake and structural findings. Results show substantial gains in natural language quality and clinical fidelity over existing baselines, while revealing challenges in cross-center CT style generalization and the need for longitudinal and quantitatively precise reporting. This work lays a foundation for disease-aware, multimodal PET/CT report generation and provides publicly available datasets and benchmarks to accelerate future research.

Abstract

Positron emission tomography/computed tomography (PET/CT) is essential in oncology, yet the rapid expansion of scanners has outpaced the availability of trained specialists, making automated PET/CT report generation (PETRG) increasingly important for reducing clinical workload. Compared with structural imaging (e.g., X-ray, CT, and MRI), functional PET poses distinct challenges: metabolic patterns vary with tracer physiology, and whole-body 3D contextual information is required rather than local-region interpretation. To advance PETRG, we propose PETRG-3D, an end-to-end 3D dual-branch framework that separately encodes PET and CT volumes and incorporates style-adaptive prompts to mitigate inter-hospital variability in reporting practices. We construct PETRG-Lym, a multi-center lymphoma dataset collected from four hospitals (824 reports w/ 245,509 paired PET/CT slices), and construct AutoPET-RG-Lym, a publicly accessible PETRG benchmark derived from open imaging data but equipped with new expert-written, clinically validated reports (135 cases). To assess clinical utility, we introduce PETRG-Score, a lymphoma-specific evaluation protocol that jointly measures metabolic and structural findings across curated anatomical regions. Experiments show that PETRG-3D substantially outperforms existing methods on both natural language metrics (e.g., +31.49\% ROUGE-L) and clinical efficacy metrics (e.g., +8.18\% PET-All), highlighting the benefits of volumetric dual-modality modeling and style-aware prompting. Overall, this work establishes a foundation for future PET/CT-specific models emphasizing disease-aware reasoning and clinically reliable evaluation. Codes, models, and AutoPET-RG-Lym will be released.

Vision-Language Models for Automated 3D PET/CT Report Generation

TL;DR

Abstract

Vision-Language Models for Automated 3D PET/CT Report Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)