Deep Learning From Routine Histology Improves Risk Stratification for Biochemical Recurrence in Prostate Cancer

Clément Grisi; Khrystyna Faryna; Nefise Uysal; Vittorio Agosti; Enrico Munari; Solène-Florence Kammerer-Jacquet; Paulo Guilherme de Oliveira Salles; Yuri Tolkach; Reinhard Büttner; Sofiya Semko; Maksym Pikul; Axel Heidenreich; Jeroen van der Laak; Geert Litjens

Deep Learning From Routine Histology Improves Risk Stratification for Biochemical Recurrence in Prostate Cancer

Clément Grisi, Khrystyna Faryna, Nefise Uysal, Vittorio Agosti, Enrico Munari, Solène-Florence Kammerer-Jacquet, Paulo Guilherme de Oliveira Salles, Yuri Tolkach, Reinhard Büttner, Sofiya Semko, Maksym Pikul, Axel Heidenreich, Jeroen van der Laak, Geert Litjens

Abstract

Accurate prediction of biochemical recurrence (BCR) after radical prostatectomy is critical for guiding adjuvant treatment and surveillance decisions in prostate cancer. However, existing clinicopathological risk models reduce complex morphology to relatively coarse descriptors, leaving substantial prognostic information embedded in routine histopathology underexplored. We present a deep learning-based biomarker that predicts continuous, patient-specific risk of BCR directly from H&E-stained whole-slide prostatectomy specimens. Trained end-to-end on time-to-event outcomes and evaluated across four independent international cohorts, our model demonstrates robust generalization across institutions and patient populations. When integrated with the CAPRA-S clinical risk score, the deep learning risk score consistently improved discrimination for BCR, increasing concordance indices from 0.725-0.772 to 0.749-0.788 across cohorts. To support clinical interpretability, outcome-grounded analyses revealed subtle histomorphological patterns associated with recurrence risk that are not captured by conventional clinicopathological risk scores. This multicohort study demonstrates that deep learning applied to routine prostate histopathology can deliver reproducible and clinically generalizable biomarkers that augment postoperative risk stratification, with potential to support personalized management of prostate cancer in real-world clinical settings.

Deep Learning From Routine Histology Improves Risk Stratification for Biochemical Recurrence in Prostate Cancer

Abstract

Paper Structure (50 sections, 16 equations, 15 figures, 9 tables)

This paper contains 50 sections, 16 equations, 15 figures, 9 tables.

Introduction
Results
Data overview and experimental setup
Model development.
External validation.
Data preprocessing.
Pretraining diversity drives generalization when models are trained on data from a single source
Conditional benefits of training data heterogeneity
Deep learning adds prognostic value beyond CAPRA-S
Model interpretability
Attention as a first approximation of model interpretability
Outcome-grounded interpretability via occlusion
Discussion
Methods
Introducing a small-scale, prostatectomy-specific encoder
...and 35 more sections

Figures (15)

Figure 1: Representative whole-slide histopathology from external test cohorts. Low-resolution thumbnails of H&E-stained prostatectomies from the four external test cohorts used in this study. The examples illustrate cohort-specific differences in tissue appearance and color profiles arising from variations in staining protocols, slide preparation, and slide acquisition. All slides were processed using the same preprocessing pipeline prior to model inference.
Figure 2: Integration of multiple histopathology slides into a unified patient-level prediction framework.(A) For patients with multiple slides, segmented tissue regions are cropped using the corresponding tissue masks and stitched into a single larger packed slide. This packing process minimizes empty background space while preserving the relative arrangement of tissue, enabling patient-level analysis in a unified coordinate frame. (B) The packed slide is tiled into non-overlapping square regions of $2048$ pixels at $0.50$ mpp. Regions with less than $1$% tissue coverage, estimated from the segmentation mask, are excluded. (C) Each retained region is then unrolled into non-overlapping $256$ pixel tiles, which are processed by a frozen tile encoder to generate tile-level embeddings. For each region, the sequence of tile embeddings is aggregated into a region-level embedding by a first Transformer. The sequence of region-level embeddings for the entire case is then aggregated by a second Transformer into a single patient-level representation, which is projected onto the target classes via a fully connected (FC) layer. The two Transformers and the FC layer are jointly trained for biochemical recurrence risk prediction. Together, these steps enable efficient processing of large whole-slide images, integrating spatially distributed information from multiple slides into a unified patient-level prediction.
Figure 3: Pairwise statistical comparisons of encoder performance across test cohorts. Heatmaps depict differences in concordance index between models trained on the RUMC-only splits and evaluated on four independent test cohorts (RUMC, PLCO, IMP, UHC). Each cell indicates the difference in performance between the encoder in the row and the encoder in the column. Positive values (green) reflect higher performance for the encoder in the row relative to the encoder in the column, while negative values (red) indicate the opposite. Significant differences are highlighted in bold.
Figure 4: Effect of training data enrichment on model performance across test cohorts. The heatmap depicts the change in concordance index ($\Delta$) for each model when trained on the combined RUMC+TCGA dataset compared to training on RUMC-only. Rows correspond to encoders and columns to test cohorts. Positive values (green) indicate improved performance with training data enrichment, while negative values (red) indicate decreased performance. Significant differences are highlighted in bold.
Figure 5: Pairwise statistical comparisons of encoder performance across test cohorts. Heatmaps depict differences in concordance index between models trained on the RUMC+TCGA splits and evaluated on four independent test cohorts (RUMC, PLCO, IMP, UHC). Each cell indicates the difference in performance between the encoder in the row and the encoder in the column. Positive values (green) reflect higher performance for the encoder in the row relative to the encoder in the column, while negative values (red) indicate the opposite. Significant differences are highlighted in bold.
...and 10 more figures

Deep Learning From Routine Histology Improves Risk Stratification for Biochemical Recurrence in Prostate Cancer

Abstract

Deep Learning From Routine Histology Improves Risk Stratification for Biochemical Recurrence in Prostate Cancer

Authors

Abstract

Table of Contents

Figures (15)