Table of Contents
Fetching ...

PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting

Danyal Maqbool, Changhee Lee, Zachary Huemann, Samuel D. Church, Matthew E. Larson, Scott B. Perlman, Tomas A. Romero, Joshua D. Warner, Meghan Lubner, Xin Tie, Jameson Merkow, Junjie Hu, Steve Y. Cho, Tyler J. Bradshaw

TL;DR

PETAR addresses the challenge of automated PET/CT reporting by introducing PETARSeg-11K, the first large-scale lesion-level dataset linking 3D masks to free-text findings, and PETAR-4B, a 3D mask-aware vision-language framework that grounds findings in localized volumes via focal prompts. The approach jointly encodes PET, CT, and lesion masks within a shared 3D transformer, leveraging staged training and TotalSegmentator pretraining to strengthen anatomical grounding. Automated metrics and a rigorous human study with five nuclear medicine physicians demonstrate that PETAR-4B achieves superior linguistic, semantic, and clinical grounding performance compared with 2D and 3D baselines, with strong alignment to expert judgment (GREEN correlation). The work provides a foundational resource and architectural blueprint for accurate, region-specific PET reporting and points to a viable path toward end-to-end automated PET report generation.

Abstract

Generating automated reports for 3D positron emission tomography (PET) is an important and challenging task in medical imaging. PET plays a vital role in oncology, but automating report generation is difficult due to the complexity of whole-body 3D volumes, the wide range of potential clinical findings, and the limited availability of annotated datasets. To address these challenges, we introduce PETARSeg-11K, the first large-scale, publicly available dataset that provides lesion-level correspondence between 3D PET/CT volumes and free-text radiological findings. It comprises 11,356 lesion descriptions paired with 3D segmentations. Second, we propose PETAR-4B, a 3D vision-language model designed for mask-aware, spatially grounded PET/CT reporting. PETAR-4B jointly encodes PET, CT, and 3D lesion segmentation masks, using a 3D focal prompt to capture fine-grained details of lesions that normally comprise less than 0.1% of the volume. Evaluations using automated metrics show PETAR-4B substantially outperforming all 2D and 3D baselines. A human study involving five physicians -- the first of its kind for automated PET reporting -- confirms the model's clinical utility and establishes correlations between automated metrics and expert judgment. This work provides a foundational dataset and a novel architecture, advancing 3D medical vision-language understanding in PET.

PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting

TL;DR

PETAR addresses the challenge of automated PET/CT reporting by introducing PETARSeg-11K, the first large-scale lesion-level dataset linking 3D masks to free-text findings, and PETAR-4B, a 3D mask-aware vision-language framework that grounds findings in localized volumes via focal prompts. The approach jointly encodes PET, CT, and lesion masks within a shared 3D transformer, leveraging staged training and TotalSegmentator pretraining to strengthen anatomical grounding. Automated metrics and a rigorous human study with five nuclear medicine physicians demonstrate that PETAR-4B achieves superior linguistic, semantic, and clinical grounding performance compared with 2D and 3D baselines, with strong alignment to expert judgment (GREEN correlation). The work provides a foundational resource and architectural blueprint for accurate, region-specific PET reporting and points to a viable path toward end-to-end automated PET report generation.

Abstract

Generating automated reports for 3D positron emission tomography (PET) is an important and challenging task in medical imaging. PET plays a vital role in oncology, but automating report generation is difficult due to the complexity of whole-body 3D volumes, the wide range of potential clinical findings, and the limited availability of annotated datasets. To address these challenges, we introduce PETARSeg-11K, the first large-scale, publicly available dataset that provides lesion-level correspondence between 3D PET/CT volumes and free-text radiological findings. It comprises 11,356 lesion descriptions paired with 3D segmentations. Second, we propose PETAR-4B, a 3D vision-language model designed for mask-aware, spatially grounded PET/CT reporting. PETAR-4B jointly encodes PET, CT, and 3D lesion segmentation masks, using a 3D focal prompt to capture fine-grained details of lesions that normally comprise less than 0.1% of the volume. Evaluations using automated metrics show PETAR-4B substantially outperforming all 2D and 3D baselines. A human study involving five physicians -- the first of its kind for automated PET reporting -- confirms the model's clinical utility and establishes correlations between automated metrics and expert judgment. This work provides a foundational dataset and a novel architecture, advancing 3D medical vision-language understanding in PET.

Paper Structure

This paper contains 32 sections, 11 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Overview of mask-guided PET/CT report generation. By incorporating lesion-level masks, PETAR produces anatomically fine-grained findings grounded in the 3D volume. In contrast, general 3D models perform global encoding without fine-grained anatomical correlation, hence they generate vague or clinically incomplete descriptions.
  • Figure 2: Overview of the PETARSeg-11K data pipeline. LLMs extract key lesion attributes (e.g., SUVmax and slice number) from radiology reports, which guide a region-growing algorithm to localize and refine PET lesions. The final findings are structured into standardized fields linking text to 3D lesion masks.
  • Figure 3: Overview of the proposed framework. The left panel illustrates the overall architecture, which integrates PET, CT, and lesion mask inputs through 3D convolutional image projectors and an M3D-CLIP backbone. The resulting multi-modal visual tokens are fused and spatially pooled before being passed to a Phi3-4B language model that generates clinically grounded text descriptions conditioned on visual features and textual prompts. The right panel details the Image Encoder design, where modality-specific projectors (for PET, CT, and mask inputs) map each input into a shared latent space. These embeddings are subsequently processed by a ViT encoder to produce modality-aligned visual tokens for downstream fusion.
  • Figure 4: A comparison between different models. PETAR consistently produces anatomically correct descriptions. Prior to fine-tuning, both MedGemma and M3D-RAD produce highly inaccurate results. After fine-tuning, the models still tend to make localisation errors. Anatomical descriptors are underlined for ease of comparison (red=incorrect, green=correct). Note: quantitative measurements (lesion size, SUVmax) are hallucinated by the models but can be easily replaced with directly measured values using the input lesion masks.
  • Figure 5: PETAR-4B predictions for examples from the autoPET dataset.