PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
Danyal Maqbool, Changhee Lee, Zachary Huemann, Samuel D. Church, Matthew E. Larson, Scott B. Perlman, Tomas A. Romero, Joshua D. Warner, Meghan Lubner, Xin Tie, Jameson Merkow, Junjie Hu, Steve Y. Cho, Tyler J. Bradshaw
TL;DR
PETAR addresses the challenge of automated PET/CT reporting by introducing PETARSeg-11K, the first large-scale lesion-level dataset linking 3D masks to free-text findings, and PETAR-4B, a 3D mask-aware vision-language framework that grounds findings in localized volumes via focal prompts. The approach jointly encodes PET, CT, and lesion masks within a shared 3D transformer, leveraging staged training and TotalSegmentator pretraining to strengthen anatomical grounding. Automated metrics and a rigorous human study with five nuclear medicine physicians demonstrate that PETAR-4B achieves superior linguistic, semantic, and clinical grounding performance compared with 2D and 3D baselines, with strong alignment to expert judgment (GREEN correlation). The work provides a foundational resource and architectural blueprint for accurate, region-specific PET reporting and points to a viable path toward end-to-end automated PET report generation.
Abstract
Generating automated reports for 3D positron emission tomography (PET) is an important and challenging task in medical imaging. PET plays a vital role in oncology, but automating report generation is difficult due to the complexity of whole-body 3D volumes, the wide range of potential clinical findings, and the limited availability of annotated datasets. To address these challenges, we introduce PETARSeg-11K, the first large-scale, publicly available dataset that provides lesion-level correspondence between 3D PET/CT volumes and free-text radiological findings. It comprises 11,356 lesion descriptions paired with 3D segmentations. Second, we propose PETAR-4B, a 3D vision-language model designed for mask-aware, spatially grounded PET/CT reporting. PETAR-4B jointly encodes PET, CT, and 3D lesion segmentation masks, using a 3D focal prompt to capture fine-grained details of lesions that normally comprise less than 0.1% of the volume. Evaluations using automated metrics show PETAR-4B substantially outperforming all 2D and 3D baselines. A human study involving five physicians -- the first of its kind for automated PET reporting -- confirms the model's clinical utility and establishes correlations between automated metrics and expert judgment. This work provides a foundational dataset and a novel architecture, advancing 3D medical vision-language understanding in PET.
