Table of Contents
Fetching ...

Anatomy-Aware Lymphoma Lesion Detection in Whole-Body PET/CT

Simone Bendazzoli, Antonios Tzortzakakis, Andreas Abrahamsson, Björn Engelbrekt Wahlin, Örjan Smedby, Maria Holstensson, Rodrigo Moreno

TL;DR

This study investigates whether incorporating anatomical priors improves lymphoma lesion detection in whole-body PET/CT. By adding 104-organ segmentation masks from TotalSegmentator to CNN-based nnDetection and to a Swin Transformer–based RetinaUNeTR, and by employing self-supervised pretraining for the transformer, the authors compare performance on AutoPET and KUH datasets. Results show substantial gains for the CNN-based nnDetection with anatomical priors, while the Swin Transformer-based approach gains little from the priors and underperforms the CNN baseline in this task. The findings highlight the value of explicit anatomical context for CNN detectors and point to the need for further transformer-specific enhancements to achieve parity in medical object detection.

Abstract

Early cancer detection is crucial for improving patient outcomes, and 18F FDG PET/CT imaging plays a vital role by combining metabolic and anatomical information. Accurate lesion detection remains challenging due to the need to identify multiple lesions of varying sizes. In this study, we investigate the effect of adding anatomy prior information to deep learning-based lesion detection models. In particular, we add organ segmentation masks from the TotalSegmentator tool as auxiliary inputs to provide anatomical context to nnDetection, which is the state-of-the-art for lesion detection, and Swin Transformer. The latter is trained in two stages that combine self-supervised pre-training and supervised fine-tuning. The method is tested in the AutoPET and Karolinska lymphoma datasets. The results indicate that the inclusion of anatomical priors substantially improves the detection performance within the nnDetection framework, while it has almost no impact on the performance of the vision transformer. Moreover, we observe that Swin Transformer does not offer clear advantages over conventional convolutional neural network (CNN) encoders used in nnDetection. These findings highlight the critical role of the anatomical context in cancer lesion detection, especially in CNN-based models.

Anatomy-Aware Lymphoma Lesion Detection in Whole-Body PET/CT

TL;DR

This study investigates whether incorporating anatomical priors improves lymphoma lesion detection in whole-body PET/CT. By adding 104-organ segmentation masks from TotalSegmentator to CNN-based nnDetection and to a Swin Transformer–based RetinaUNeTR, and by employing self-supervised pretraining for the transformer, the authors compare performance on AutoPET and KUH datasets. Results show substantial gains for the CNN-based nnDetection with anatomical priors, while the Swin Transformer-based approach gains little from the priors and underperforms the CNN baseline in this task. The findings highlight the value of explicit anatomical context for CNN detectors and point to the need for further transformer-specific enhancements to achieve parity in medical object detection.

Abstract

Early cancer detection is crucial for improving patient outcomes, and 18F FDG PET/CT imaging plays a vital role by combining metabolic and anatomical information. Accurate lesion detection remains challenging due to the need to identify multiple lesions of varying sizes. In this study, we investigate the effect of adding anatomy prior information to deep learning-based lesion detection models. In particular, we add organ segmentation masks from the TotalSegmentator tool as auxiliary inputs to provide anatomical context to nnDetection, which is the state-of-the-art for lesion detection, and Swin Transformer. The latter is trained in two stages that combine self-supervised pre-training and supervised fine-tuning. The method is tested in the AutoPET and Karolinska lymphoma datasets. The results indicate that the inclusion of anatomical priors substantially improves the detection performance within the nnDetection framework, while it has almost no impact on the performance of the vision transformer. Moreover, we observe that Swin Transformer does not offer clear advantages over conventional convolutional neural network (CNN) encoders used in nnDetection. These findings highlight the critical role of the anatomical context in cancer lesion detection, especially in CNN-based models.

Paper Structure

This paper contains 13 sections, 2 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: PET/CT scan from the AutoPET Dataset, with manual annotations of Lymphoma lesions highlighted in green. From left to right: Axial, Coronal and Sagittal views.
  • Figure 2: The proposed Swin Autoencoder network for the self-supervised task. A Swin Transformer is used as the feature extractor for the input PET/CT volumes, followed by a sequence of two transposed convolution layers to upsample the extracted features back to the original image resolution.
  • Figure 3: Retina U-Net architecture. The network consists of a 5-stage convolutional encoder (in yellow), followed by a convolutional Feature Pyramid Network, serving as a decoder (in red). An auxiliary segmentation layer is attached to the highest-resolution decoder level to support the auxiliary segmentation task. Two parallel detection heads, one for box classification and one for box regression, are connected to the four lower decoder levels to enable multi-scale object detection, combining the different spatial resolutions of each decoder level.
  • Figure 4: Proposed Swin RetinaUNeTR architecture. While maintaining structural similarity to Retina U-Net in the decoder and detection head components, the convolutional encoder is replaced by a Swin Transformer, serving as a vision transformer-based feature extractor.
  • Figure 5: Overview of the experimental workflow. The process begins with a self-supervised pretraining phase, where random PET/CT patches are corrupted and reconstructed using a Swin Transformer-based autoencoder. This is followed by an object detection training stage, where the pretrained Swin Transformer is integrated into a Feature Pyramid Network (FPN) and detection heads for multi-scale object detection.
  • ...and 7 more figures