Table of Contents
Fetching ...

Vision-Language Modeling in PET/CT for Visual Grounding of Positive Findings

Zachary Huemann, Samuel Church, Joshua D. Warner, Daniel Tran, Xin Tie, Alan B McMillan, Junjie Hu, Steve Y. Cho, Meghan Lubner, Tyler J. Bradshaw

TL;DR

The paper tackles the lack of large annotated PET/CT datasets for visual grounding by introducing a weak-labeling pipeline that links SUVmax and axial slice mentions in radiology reports to precise image locations. It trains a 3D vision-language model, ConTEXTual Net 3D, that fuses RadBERT-based text representations with a 3D nnU-Net via cross-attention to produce voxel-level lesion segmentations guided by description text. The approach yields a high localization accuracy in labeling (98% correct locations) and an F1 of 0.80 on a physician-annotated test set, outperforming several baselines but not yet matching board-certified radiologists. Findings show strong performance on high-uptake tracers like FDG and DCFPyL, with reduced accuracy for lower-uptake lesions and certain tracers, underscoring the need for larger, multi-center datasets and further refinement of temporal/textual disambiguation for broader clinical deployment.

Abstract

Vision-language models can connect the text description of an object to its specific location in an image through visual grounding. This has potential applications in enhanced radiology reporting. However, these models require large annotated image-text datasets, which are lacking for PET/CT. We developed an automated pipeline to generate weak labels linking PET/CT report descriptions to their image locations and used it to train a 3D vision-language visual grounding model. Our pipeline finds positive findings in PET/CT reports by identifying mentions of SUVmax and axial slice numbers. From 25,578 PET/CT exams, we extracted 11,356 sentence-label pairs. Using this data, we trained ConTEXTual Net 3D, which integrates text embeddings from a large language model with a 3D nnU-Net via token-level cross-attention. The model's performance was compared against LLMSeg, a 2.5D version of ConTEXTual Net, and two nuclear medicine physicians. The weak-labeling pipeline accurately identified lesion locations in 98% of cases (246/251), with 7.5% requiring boundary adjustments. ConTEXTual Net 3D achieved an F1 score of 0.80, outperforming LLMSeg (F1=0.22) and the 2.5D model (F1=0.53), though it underperformed both physicians (F1=0.94 and 0.91). The model achieved better performance on FDG (F1=0.78) and DCFPyL (F1=0.75) exams, while performance dropped on DOTATE (F1=0.58) and Fluciclovine (F1=0.66). The model performed consistently across lesion sizes but showed reduced accuracy on lesions with low uptake. Our novel weak labeling pipeline accurately produced an annotated dataset of PET/CT image-text pairs, facilitating the development of 3D visual grounding models. ConTEXTual Net 3D significantly outperformed other models but fell short of the performance of nuclear medicine physicians. Our study suggests that even larger datasets may be needed to close this performance gap.

Vision-Language Modeling in PET/CT for Visual Grounding of Positive Findings

TL;DR

The paper tackles the lack of large annotated PET/CT datasets for visual grounding by introducing a weak-labeling pipeline that links SUVmax and axial slice mentions in radiology reports to precise image locations. It trains a 3D vision-language model, ConTEXTual Net 3D, that fuses RadBERT-based text representations with a 3D nnU-Net via cross-attention to produce voxel-level lesion segmentations guided by description text. The approach yields a high localization accuracy in labeling (98% correct locations) and an F1 of 0.80 on a physician-annotated test set, outperforming several baselines but not yet matching board-certified radiologists. Findings show strong performance on high-uptake tracers like FDG and DCFPyL, with reduced accuracy for lower-uptake lesions and certain tracers, underscoring the need for larger, multi-center datasets and further refinement of temporal/textual disambiguation for broader clinical deployment.

Abstract

Vision-language models can connect the text description of an object to its specific location in an image through visual grounding. This has potential applications in enhanced radiology reporting. However, these models require large annotated image-text datasets, which are lacking for PET/CT. We developed an automated pipeline to generate weak labels linking PET/CT report descriptions to their image locations and used it to train a 3D vision-language visual grounding model. Our pipeline finds positive findings in PET/CT reports by identifying mentions of SUVmax and axial slice numbers. From 25,578 PET/CT exams, we extracted 11,356 sentence-label pairs. Using this data, we trained ConTEXTual Net 3D, which integrates text embeddings from a large language model with a 3D nnU-Net via token-level cross-attention. The model's performance was compared against LLMSeg, a 2.5D version of ConTEXTual Net, and two nuclear medicine physicians. The weak-labeling pipeline accurately identified lesion locations in 98% of cases (246/251), with 7.5% requiring boundary adjustments. ConTEXTual Net 3D achieved an F1 score of 0.80, outperforming LLMSeg (F1=0.22) and the 2.5D model (F1=0.53), though it underperformed both physicians (F1=0.94 and 0.91). The model achieved better performance on FDG (F1=0.78) and DCFPyL (F1=0.75) exams, while performance dropped on DOTATE (F1=0.58) and Fluciclovine (F1=0.66). The model performed consistently across lesion sizes but showed reduced accuracy on lesions with low uptake. Our novel weak labeling pipeline accurately produced an annotated dataset of PET/CT image-text pairs, facilitating the development of 3D visual grounding models. ConTEXTual Net 3D significantly outperformed other models but fell short of the performance of nuclear medicine physicians. Our study suggests that even larger datasets may be needed to close this performance gap.

Paper Structure

This paper contains 18 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The data preprocessing pipeline is shown above. First, we split the PET reports into individual sentences. We extract all sentences that contain a slice number and an SUVmax. Of those, we check which ones contain anatomical descriptor terms using RadGraph. We then used LLMs and in-context learning to filter out sentences describing prior imaging and sentences containing multiple findings. For imaging annotations, the reported slice number is searched for the specified SUVmax. If the SUVmax is found, we use an iterative thresholding method to create the label. This results in a training dataset of PET/CT images, descriptive sentences, and referring segmentations for developing a visual grounding model.
  • Figure 2: The 3D multimodal vision-language model used for visual grounding of PET findings. The 3D PET/CT is encoded via a 3D nn-Unet and the sentence is encoded via RadBERT. For cross-attention, the language embeddings are used as the key and the value with the vision features as the query, which produces voxel-wise attention maps which are then applied to the voxel space. This mechanism allows for text-guided segmentation.
  • Figure 3: Example images and descriptions overlaid with the model outputs. True positives are shown in green, false positives are shown in red, and false negatives are shown in blue.
  • Figure 4: Performance of ConTEXTual Net 3D across varying training data sizes compared to physician benchmarks, with confidence intervals.
  • Figure A1: Multi-shot in context prompt used to extract SUVmax and axial slice numbers for the present scan.