Table of Contents
Fetching ...

SurGen: 1020 H&E-stained Whole Slide Images With Survival and Genetic Markers

Craig Myles, In Hwa Um, Craig Marshall, David Harris-Birtill, David J. Harrison

TL;DR

SurGen introduces a large, publicly available dataset of 1,020 H&E-stained whole-slide images (WSIs) from 843 colorectal cancer cases, annotated with key genetic mutations (KRAS, NRAS, BRAF), mismatch repair status, microsatellite instability, and, for a subset, survival data. The dataset comprises two cohorts, SR386 (primary CRC with survival data) and SR1482 (colorectal cancer with metastases), each scanned at 40× with uniform resolution (0.1112 μm/pixel) and stored as CZIs, enabling robust imaging-genomic analyses. A proof-of-concept transformer-based model using UNI patch embeddings achieves a test AUROC of 0.8273 for predicting MMR/MSI from WSIs, significantly outperforming prior results on smaller cohorts and illustrating SurGen’s potential for biomarker discovery, prognostic modelling, and foundation-model development in digital pathology. The authors provide extensive data and code resources to support reproducibility and external benchmarking, highlighting SurGen’s potential to accelerate personalised medicine and cross-cohort validation in colorectal oncology.

Abstract

Cancer remains one of the leading causes of morbidity and mortality worldwide. Comprehensive datasets that combine histopathological images with genetic and survival data across various tumour sites are essential for advancing computational pathology and personalised medicine. We present SurGen, a dataset comprising 1,020 H&E-stained whole-slide images (WSIs) from 843 colorectal cancer cases. The dataset includes detailed annotations for key genetic mutations (KRAS, NRAS, BRAF) and mismatch repair status, as well as survival data for 426 cases. We illustrate SurGen's utility with a proof-of-concept model that predicts mismatch repair status directly from WSIs, achieving a test area under the receiver operating characteristic curve of 0.8273. These preliminary results underscore the dataset's potential to facilitate research in biomarker discovery, prognostic modelling, and advanced machine learning applications in colorectal cancer and beyond. SurGen offers a valuable resource for the scientific community, enabling studies that require high-quality WSIs linked with comprehensive clinical and genetic information on colorectal cancer. Our initial findings affirm the dataset's capacity to advance diagnostic precision and foster the development of personalised treatment strategies in colorectal oncology. Data available online: https://doi.org/10.6019/S-BIAD1285.

SurGen: 1020 H&E-stained Whole Slide Images With Survival and Genetic Markers

TL;DR

SurGen introduces a large, publicly available dataset of 1,020 H&E-stained whole-slide images (WSIs) from 843 colorectal cancer cases, annotated with key genetic mutations (KRAS, NRAS, BRAF), mismatch repair status, microsatellite instability, and, for a subset, survival data. The dataset comprises two cohorts, SR386 (primary CRC with survival data) and SR1482 (colorectal cancer with metastases), each scanned at 40× with uniform resolution (0.1112 μm/pixel) and stored as CZIs, enabling robust imaging-genomic analyses. A proof-of-concept transformer-based model using UNI patch embeddings achieves a test AUROC of 0.8273 for predicting MMR/MSI from WSIs, significantly outperforming prior results on smaller cohorts and illustrating SurGen’s potential for biomarker discovery, prognostic modelling, and foundation-model development in digital pathology. The authors provide extensive data and code resources to support reproducibility and external benchmarking, highlighting SurGen’s potential to accelerate personalised medicine and cross-cohort validation in colorectal oncology.

Abstract

Cancer remains one of the leading causes of morbidity and mortality worldwide. Comprehensive datasets that combine histopathological images with genetic and survival data across various tumour sites are essential for advancing computational pathology and personalised medicine. We present SurGen, a dataset comprising 1,020 H&E-stained whole-slide images (WSIs) from 843 colorectal cancer cases. The dataset includes detailed annotations for key genetic mutations (KRAS, NRAS, BRAF) and mismatch repair status, as well as survival data for 426 cases. We illustrate SurGen's utility with a proof-of-concept model that predicts mismatch repair status directly from WSIs, achieving a test area under the receiver operating characteristic curve of 0.8273. These preliminary results underscore the dataset's potential to facilitate research in biomarker discovery, prognostic modelling, and advanced machine learning applications in colorectal cancer and beyond. SurGen offers a valuable resource for the scientific community, enabling studies that require high-quality WSIs linked with comprehensive clinical and genetic information on colorectal cancer. Our initial findings affirm the dataset's capacity to advance diagnostic precision and foster the development of personalised treatment strategies in colorectal oncology. Data available online: https://doi.org/10.6019/S-BIAD1285.

Paper Structure

This paper contains 43 sections, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Heatmap of WSI dimensions (in pixels) across the SurGen dataset, illustrating the variability in image sizes due to differing tissue sample areas.
  • Figure 2: Hierarchical zoom visualisation of case SR1482_T412 with dimensions $242{,}506 \times 134{,}026$ pixels, corresponding to $26{,}974.20 \times 14{,}907.85 \, \mu m$. A) A low-resolution macro image of the whole slide, providing full anatomical context. B) Digitised whole slide image viewed at low-magnification. C) Successive zoom-ins of the selected region from b), providing increased granularity, enabling detailed examination of tissue structures while retaining the broader context. This hierarchical approach allows comprehensive visual exploration of tissue characteristics at varying scales. In practice, the pyramid levels are typically generated via Gaussian down-sampling to simulate various levels of magnification but enable an immediate interface for retrieving images at varying resolutions.
  • Figure 3: Illustration of the age distributions (in years) of patients in each cohort, split by sex. The width of each violin represents the density of data points at different ages, highlighting the distributions within and across the cohorts.
  • Figure 4: Bar chart depicting the 5-year survival outcomes of the SR386 cohort. The chart shows the number of individuals who survived (n=264) versus those who did not survive (n=161) within the 5-year period following diagnosis. The data excludes instances where survival status was not recorded (NULL values).
  • Figure 5: Kaplan-Meier survival curve illustrating estimated survival probabilities over time. Censoring occurred for patients who survived beyond the 5-year study duration, as they were not followed further. The curve reflects the proportion of individuals surviving at each time point, with 95% confidence intervals representing the uncertainty in these estimates.
  • ...and 10 more figures