Table of Contents
Fetching ...

Comprehensive Evaluation of Quantitative Measurements from Automated Deep Segmentations of PSMA PET/CT Images

Obed Korshie Dzikunu, Amirhossein Toosi, Shadab Ahamed, Sara Harsini, Francois Benard, Xiaoxiao Li, Arman Rahmim

TL;DR

This work addresses the need to quantify PSMA PET/CT lesions using clinically meaningful metrics beyond DSC. It systematically compares three 3D-CNN architectures and four loss functions, introducing a novel L1-weighted Dice Focal Loss (L1DFL) and evaluating six quantitative metrics including SUVmean, SUVmax, TMTV, TLA, Dmax, and lesion count. Results show that Attention U-Net combined with L1DFL yields the strongest ground-truth concordance for SUVmax and TLA, with equivalence testing indicating high clinical agreement for SUV metrics, lesion count, and TLA, though volume-based metrics like TMTV and Dmax remain more variable. The findings suggest that L1DFL improves the clinical reliability of automated quantification across architectures, offering a practical path toward robust, clinically actionable PSMA PET/CT analysis; code is publicly available for reproducibility. The work advances quantitative imaging by linking segmentation quality to clinically relevant metrics and highlighting remaining challenges in highly variable lesion-volume metrics.

Abstract

This study performs a comprehensive evaluation of quantitative measurements as extracted from automated deep-learning-based segmentation methods, beyond traditional Dice Similarity Coefficient assessments, focusing on six quantitative metrics, namely SUVmax, SUVmean, total lesion activity (TLA), tumor volume (TMTV), lesion count, and lesion spread. We analyzed 380 prostate-specific membrane antigen (PSMA) targeted [18F]DCFPyL PET/CT scans of patients with biochemical recurrence of prostate cancer, training deep neural networks, U-Net, Attention U-Net and SegResNet with four loss functions: Dice Loss, Dice Cross Entropy, Dice Focal Loss, and our proposed L1 weighted Dice Focal Loss (L1DFL). Evaluations indicated that Attention U-Net paired with L1DFL achieved the strongest correlation with the ground truth (concordance correlation = 0.90-0.99 for SUVmax and TLA), whereas models employing the Dice Loss and the other two compound losses, particularly with SegResNet, underperformed. Equivalence testing (TOST, alpha = 0.05, Delta = 20%) confirmed high performance for SUV metrics, lesion count and TLA, with L1DFL yielding the best performance. By contrast, tumor volume and lesion spread exhibited greater variability. Bland-Altman, Coverage Probability, and Total Deviation Index analyses further highlighted that our proposed L1DFL minimizes variability in quantification of the ground truth clinical measures. The code is publicly available at: https://github.com/ObedDzik/pca\_segment.git.

Comprehensive Evaluation of Quantitative Measurements from Automated Deep Segmentations of PSMA PET/CT Images

TL;DR

This work addresses the need to quantify PSMA PET/CT lesions using clinically meaningful metrics beyond DSC. It systematically compares three 3D-CNN architectures and four loss functions, introducing a novel L1-weighted Dice Focal Loss (L1DFL) and evaluating six quantitative metrics including SUVmean, SUVmax, TMTV, TLA, Dmax, and lesion count. Results show that Attention U-Net combined with L1DFL yields the strongest ground-truth concordance for SUVmax and TLA, with equivalence testing indicating high clinical agreement for SUV metrics, lesion count, and TLA, though volume-based metrics like TMTV and Dmax remain more variable. The findings suggest that L1DFL improves the clinical reliability of automated quantification across architectures, offering a practical path toward robust, clinically actionable PSMA PET/CT analysis; code is publicly available for reproducibility. The work advances quantitative imaging by linking segmentation quality to clinically relevant metrics and highlighting remaining challenges in highly variable lesion-volume metrics.

Abstract

This study performs a comprehensive evaluation of quantitative measurements as extracted from automated deep-learning-based segmentation methods, beyond traditional Dice Similarity Coefficient assessments, focusing on six quantitative metrics, namely SUVmax, SUVmean, total lesion activity (TLA), tumor volume (TMTV), lesion count, and lesion spread. We analyzed 380 prostate-specific membrane antigen (PSMA) targeted [18F]DCFPyL PET/CT scans of patients with biochemical recurrence of prostate cancer, training deep neural networks, U-Net, Attention U-Net and SegResNet with four loss functions: Dice Loss, Dice Cross Entropy, Dice Focal Loss, and our proposed L1 weighted Dice Focal Loss (L1DFL). Evaluations indicated that Attention U-Net paired with L1DFL achieved the strongest correlation with the ground truth (concordance correlation = 0.90-0.99 for SUVmax and TLA), whereas models employing the Dice Loss and the other two compound losses, particularly with SegResNet, underperformed. Equivalence testing (TOST, alpha = 0.05, Delta = 20%) confirmed high performance for SUV metrics, lesion count and TLA, with L1DFL yielding the best performance. By contrast, tumor volume and lesion spread exhibited greater variability. Bland-Altman, Coverage Probability, and Total Deviation Index analyses further highlighted that our proposed L1DFL minimizes variability in quantification of the ground truth clinical measures. The code is publicly available at: https://github.com/ObedDzik/pca\_segment.git.

Paper Structure

This paper contains 48 sections, 19 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: An illustration of the weighting strategy of the L1-weighted Dice Focal Loss function. (a) Ground truth where 0 represents background and 1 represents foreground. (b) Predicted probability map (c) L1 norms between the predicted probabilities and the ground truth labels. Values closer to 0 highlight accurate model performance and values closer to 1 highlight difficulty in pixel classification (d) shows histograms of pixel count for L1 norm values, with each bin having a width of 0.1. (e) Computed norm density based on the pixel count and bin width (f) Calculated weight based on the total number of elements and the norm density. Higher weights are assigned to regions of higher norm values and lower class frequency (mainly foreground region). The background regions and areas with lower norm values are assigned lower weights.
  • Figure 2: Illustration for defining a true positive detection based on an overlap with the voxel containing the maximum standardized uptake value (SUV$_{\text{max}}$) in the ground truth lesion. For a false negative (FN), either the overlap does not include the SUV$_{\text{max}}$ voxel, or so a for a given $G_l$ there is not a corresponding matched $P_l$. Similarly, a false positive (FP), is a prediction, $P_l$, for which there is no $G_l$. G is the set of ground truth lesions and P is the set of predicted lesions.
  • Figure 3: Radar plots showing the correlation between the predicted and ground truth metrics for the four loss functions, Dice Loss (DL), Dice Cross-Entropy (DCE), Dice Focal Loss (DFL), and L1-weighted Dice Focal Loss (L1DFL) assessed by Lin's Concordance Correlation Coefficient. The different radar plots illustrate the performance of the three architectures - U-Net, Attention U-Net, and SegResNet. The radial lines represent the correlation coefficient values. SUV$_{\text{mean}}$: mean standardized uptake value, SUV$_{\text{max}}$: maximum standardized uptake value, TMTV: total molecular tumor volume, TLA: total lesion activity, Dmax: lesion dissemination, L: lesion count.
  • Figure 4: Two one-sided forest plots illustrating the results of equivalence testing for predicted metrics versus ground truth across different loss functions. Each row represents a specific metric, and the columns correspond to the different networks. The black vertical dashed lines represent the region of clinical equivalence ($\pm 20\% \text{ of the mean of each ground truth metric}$). The x-axis represents the mean difference between the predicted and ground truth metrics. Green bars represent equivalence between predictions of a given loss function and ground truth values at a significance level of $\alpha = 0.05$, while red bars indicate non-equivalence.
  • Figure 5: Figure: Modified Bland-Altman plots illustrating the variations ($\Delta$) between predicted and ground truth values (y-axis) against the ground truth metrics (x-axis) for different loss functions, represented by distinct colors. Each row corresponds to a specific clinical metric, and the columns represent predictions from different network architectures (U-Net, SegResNet, and Attention U-Net). The dashed black horizontal lines indicate the limits of agreement ($\pm 20\%$ of the mean ground truth values), while the dotted line at $\Delta = 0$ represents perfect agreement. Each data point reflects a single instance, highlighting the distribution of variations across ground truth values.
  • ...and 2 more figures