Comprehensive framework for evaluation of deep neural networks in detection and quantification of lymphoma from PET/CT images: clinical insights, pitfalls, and observer agreement analyses

Shadab Ahamed; Yixi Xu; Sara Kurkowska; Claire Gowdy; Joo H. O; Ingrid Bloise; Don Wilson; Patrick Martineau; François Bénard; Fereshteh Yousefirizi; Rahul Dodhia; Juan M. Lavista; William B. Weeks; Carlos F. Uribe; Arman Rahmim

Comprehensive framework for evaluation of deep neural networks in detection and quantification of lymphoma from PET/CT images: clinical insights, pitfalls, and observer agreement analyses

Shadab Ahamed, Yixi Xu, Sara Kurkowska, Claire Gowdy, Joo H. O, Ingrid Bloise, Don Wilson, Patrick Martineau, François Bénard, Fereshteh Yousefirizi, Rahul Dodhia, Juan M. Lavista, William B. Weeks, Carlos F. Uribe, Arman Rahmim

TL;DR

This work addresses the generalization and clinical relevance gaps in automated lymphoma segmentation from PET/CT by proposing a framework that integrates out-of-distribution testing, lesion-specific metrics, and observer variability analyses. It evaluates four state-of-the-art networks (ResUNet, SegResNet, DynUNet, SwinUNETR) on 611 cases from multi-institutional cohorts, comparing performance against human experts using segmentation, per-lesion detection criteria, and six clinically meaningful lesion measures ($SUV_{mean}$, $SUV_{max}$, $L$, $TMTV$, $TLG$, $D_{max}$). The study finds SegResNet and ResUNet generally superior in DSC and FPV, SwinUNETR excelling in FNV, and demonstrates that clinically relevant metrics and detector criteria (especially SUVmax-based Criterion 3) reveal nuances that DSC alone misses; physician performance is competitive and sometimes superior in metabolically focused detection, underscoring the value of integrating human expertise with automated tools. The results highlight the importance of multi-physician ground truth, out-of-distribution validation, and metrics that align with clinical decision-making to improve translation of PET/CT lymphoma segmentation into practice.

Abstract

This study addresses critical gaps in automated lymphoma segmentation from PET/CT images, focusing on issues often overlooked in existing literature. While deep learning has been applied for lymphoma lesion segmentation, few studies incorporate out-of-distribution testing, raising concerns about model generalizability across diverse imaging conditions and patient populations. We highlight the need to compare model performance with expert human annotators, including intra- and inter-observer variability, to understand task difficulty better. Most approaches focus on overall segmentation accuracy but overlook lesion-specific measures important for precise lesion detection and disease quantification. To address these gaps, we propose a clinically relevant framework for evaluating deep segmentation networks. Using this lesion measure-specific evaluation, we assess the performance of four deep networks (ResUNet, SegResNet, DynUNet, and SwinUNETR) across 611 cases from multi-institutional datasets, covering various lymphoma subtypes and lesion characteristics. Beyond standard metrics like the Dice similarity coefficient, we evaluate clinical lesion measures and their prediction errors. We also introduce detection criteria for lesion localization and propose a new detection Criterion 3 based on metabolic characteristics. We show that networks perform better on large, intense lesions with higher metabolic activity. Finally, we compare network performance to physicians via intra- and inter-observer variability analyses, demonstrating that network errors closely resemble those made by experts, i.e., the small and faint lesions remain challenging for both humans and networks. This study aims to improve automated lesion segmentation's clinical relevance, supporting better treatment decisions for lymphoma patients. The code is available at: https://github.com/microsoft/lymphoma-segmentation-dnn.

Comprehensive framework for evaluation of deep neural networks in detection and quantification of lymphoma from PET/CT images: clinical insights, pitfalls, and observer agreement analyses

TL;DR

). The study finds SegResNet and ResUNet generally superior in DSC and FPV, SwinUNETR excelling in FNV, and demonstrates that clinically relevant metrics and detector criteria (especially SUVmax-based Criterion 3) reveal nuances that DSC alone misses; physician performance is competitive and sometimes superior in metabolically focused detection, underscoring the value of integrating human expertise with automated tools. The results highlight the importance of multi-physician ground truth, out-of-distribution validation, and metrics that align with clinical decision-making to improve translation of PET/CT lymphoma segmentation into practice.

Abstract

Paper Structure (37 sections, 8 equations, 15 figures, 5 tables)

This paper contains 37 sections, 8 equations, 15 figures, 5 tables.

Introduction
Related work
Materials and methods
Dataset
Ground truth (GT) annotations
Networks, tools and code
Training methodology
Evaluation
Segmentation metrics
Detection metrics
Clinically-relevant lesion measures and intra- and inter-observer agreement analysis
Lesion measure threshold analysis
Results
Segmentation performance
Reproducibility of lesion measures
...and 22 more sections

Figures (15)

Figure 1: (a) Illustration for the two segmentation metrics false positive volume (FPV) and false negative volume (FNV). (b) Illustration for defining a true positive detection via three criteria, as explained in Sec. \ref{['subsubsec:detection_metrics']}.
Figure 2: Distribution of ground truth (GT) lesion measures on the test sets from different cohorts, showing the diversity of the datasets. The subscript $g$ denotes GT, i.e., these measures were extracted from the GT segmentation masks annotated by physicians.
Figure 3: Modified Bland-Altman plots showing errors (predicted - GT) in the estimation of lesion measures as a function of GT lesion measure values for the four networks, ResUNet, SegResNet, DynUNet, and SwinUNETR on the combined internal and external test set ($N_\text{cases} = 233$). The black dashed line represents the mean error over all networks and the black dotted lines represent $\pm 1.96$ SD on mean. The $x$-axis has been represented on log scale.
Figure 4: SegResNet performance (DSC) distribution for different GT lesion measures on various test sets. For each test set, the DSC measure distributions have been presented as boxplots in three different categories, (i) Lesion measure $\leq$ 20%tile, (ii) 20%tile $<$ Lesion measure $\leq$ 75%tile, (iii) Lesion measure $>$ 75%tile. The mean and median values for each box have been represented as white circles and black horizontal lines, respectively. The boxes below each plot show the value of the 20%tile and 75%tile lesion measure on each of the test sets. Additional plots for ResUNet, DynUNet and SwinUNETR have been presented in Figs. \ref{['fig:lesion_measures_segregated_dsc_metrics_unet']}, \ref{['fig:lesion_measures_segregated_dsc_metrics_dynunet']}, and \ref{['fig:lesion_measures_segregated_dsc_metrics_swinunetr']} respectively in Appendix \ref{['subsubsec:lesion_measure_segregated_dsc_for_other_networks']}.
Figure 5: The effect of GT lesion measure values from subsets of test cases (internal and external combined with $N_\text{cases} = 233$) on network performance. For a lesion measure $b$, a threshold $t_b$ was chosen and a subset of internal test cases were selected with $b \geq t_b$ and median DSC was computed on this subset. The value of $t_b$ were chosen in the range $[\mathcal{B}_0, \mathcal{B}_{85}]$ at steps of $\Delta t_b$, where $\mathcal{B}_0$ and $\mathcal{B}_{85}$ represent the 0$^\text{th}$ and 85$^\text{th}$ quantile of the set of all lesion measures on the internal test set, $\mathcal{B} = \{b_i\}_{i=1}^{N_\text{cases}}$. (a), (b), (d), and (e) show that, in general, the performance of networks increase on subset with larger values of $\text{SUV}_\text{mean}$, $\text{SUV}_\text{max}$, TMTV, and TLG, respectively up to certain values of $t_b$ (after which the performance plateaus), while for the number of lesions (c) and $\text{D}_\text{max}$ (e) this increase isn't very prominent.
...and 10 more figures

Comprehensive framework for evaluation of deep neural networks in detection and quantification of lymphoma from PET/CT images: clinical insights, pitfalls, and observer agreement analyses

TL;DR

Abstract

Comprehensive framework for evaluation of deep neural networks in detection and quantification of lymphoma from PET/CT images: clinical insights, pitfalls, and observer agreement analyses

Authors

TL;DR

Abstract

Table of Contents

Figures (15)