Table of Contents
Fetching ...

PET2Rep: Towards Vision-Language Model-Drived Automated Radiology Report Generation for Positron Emission Tomography

Yichi Zhang, Wenbo Zhang, Zehui Ling, Gang Feng, Sisi Peng, Deshu Chen, Yuchen Liu, Hongwei Zhang, Shuqi Wang, Lanlan Li, Limei Han, Yuan Cheng, Zixin Hu, Yuan Qi, Le Xue

TL;DR

PET2Rep defines the first PET/CT-focused radiology report generation benchmark, revealing that current vision-language models struggle to produce clinically usable, structured, whole-body reports. By integrating a large-scale, real-clinical-scenario dataset with a standardized prompting framework and a dual set of evaluation metrics (NLG and PET Clinical Efficacy), the work demonstrates a substantial gap between state-of-the-art VLM performance and practical clinical requirements. The benchmark highlights the need for domain-specific data curation, 3D spatial reasoning, and clinically grounded evaluation metrics to advance automated molecular-imaging reporting. Collectively, PET2Rep lays the groundwork for targeted model development and rigorous assessment that align with real-world radiology practice and decision-making.

Abstract

Positron emission tomography (PET) is a cornerstone of modern oncologic and neurologic imaging, distinguished by its unique ability to illuminate dynamic metabolic processes that transcend the anatomical focus of traditional imaging technologies. Radiology reports are essential for clinical decision making, yet their manual creation is labor-intensive and time-consuming. Recent advancements of vision-language models (VLMs) have shown strong potential in medical applications, presenting a promising avenue for automating report generation. However, existing applications of VLMs in the medical domain have predominantly focused on structural imaging modalities, while the unique characteristics of molecular PET imaging have largely been overlooked. To bridge the gap, we introduce PET2Rep, a large-scale comprehensive benchmark for evaluation of general and medical VLMs for radiology report generation for PET images. PET2Rep stands out as the first dedicated dataset for PET report generation with metabolic information, uniquely capturing whole-body image-report pairs that cover dozens of organs to fill the critical gap in existing benchmarks and mirror real-world clinical comprehensiveness. In addition to widely recognized natural language generation metrics, we introduce a series of clinical efficacy metrics to evaluate the quality of radiotracer uptake pattern description in key organs in generated reports. We conduct a head-to-head comparison of 30 cutting-edge general-purpose and medical-specialized VLMs. The results show that the current state-of-the-art VLMs perform poorly on PET report generation task, falling considerably short of fulfilling practical needs. Moreover, we identify several key insufficiency that need to be addressed to advance the development in medical applications.

PET2Rep: Towards Vision-Language Model-Drived Automated Radiology Report Generation for Positron Emission Tomography

TL;DR

PET2Rep defines the first PET/CT-focused radiology report generation benchmark, revealing that current vision-language models struggle to produce clinically usable, structured, whole-body reports. By integrating a large-scale, real-clinical-scenario dataset with a standardized prompting framework and a dual set of evaluation metrics (NLG and PET Clinical Efficacy), the work demonstrates a substantial gap between state-of-the-art VLM performance and practical clinical requirements. The benchmark highlights the need for domain-specific data curation, 3D spatial reasoning, and clinically grounded evaluation metrics to advance automated molecular-imaging reporting. Collectively, PET2Rep lays the groundwork for targeted model development and rigorous assessment that align with real-world radiology practice and decision-making.

Abstract

Positron emission tomography (PET) is a cornerstone of modern oncologic and neurologic imaging, distinguished by its unique ability to illuminate dynamic metabolic processes that transcend the anatomical focus of traditional imaging technologies. Radiology reports are essential for clinical decision making, yet their manual creation is labor-intensive and time-consuming. Recent advancements of vision-language models (VLMs) have shown strong potential in medical applications, presenting a promising avenue for automating report generation. However, existing applications of VLMs in the medical domain have predominantly focused on structural imaging modalities, while the unique characteristics of molecular PET imaging have largely been overlooked. To bridge the gap, we introduce PET2Rep, a large-scale comprehensive benchmark for evaluation of general and medical VLMs for radiology report generation for PET images. PET2Rep stands out as the first dedicated dataset for PET report generation with metabolic information, uniquely capturing whole-body image-report pairs that cover dozens of organs to fill the critical gap in existing benchmarks and mirror real-world clinical comprehensiveness. In addition to widely recognized natural language generation metrics, we introduce a series of clinical efficacy metrics to evaluate the quality of radiotracer uptake pattern description in key organs in generated reports. We conduct a head-to-head comparison of 30 cutting-edge general-purpose and medical-specialized VLMs. The results show that the current state-of-the-art VLMs perform poorly on PET report generation task, falling considerably short of fulfilling practical needs. Moreover, we identify several key insufficiency that need to be addressed to advance the development in medical applications.

Paper Structure

This paper contains 34 sections, 4 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: An overview of PET2Rep benchmark. Each case contains whole-body PET/CT images with radiology report.
  • Figure 2: Pipeline of the PET2Rep benchmark for evaluation of VLM-based PET radiology report generation. First, PET/CT images are analyzed by VLMs with a designed prompt format to include necessary information such as image modality, clinical task, and designed report template based on radiologist training guidelines. Then the generated reports are evaluated against the ground-truth reports with widely recognized natural language generation (NLG) metrics and a novel clinical efficacy (CE) metric for PET imaging. We further conduct manual scoring by radiologists for more comprehensive evaluation.
  • Figure 3: Performance comparison of three VLMs under different task settings for manual evaluation by two radiologists rated across five dimensions, including Medical Accuracy (MedAcc), Key Findings Completeness (FinCom), Expression Clarity (ExpCla), Clinical Usability (CliUsa) and Overall Rating (OveRat).
  • Figure 4: An example case with expert-annotated radiology report.
  • Figure 5: An example of CT image before and after preprocessing.
  • ...and 6 more figures