Table of Contents
Fetching ...

Exploring the Interplay Between Colorectal Cancer Subtypes Genomic Variants and Cellular Morphology: A Deep-Learning Approach

Hadar Hezi, Daniel Shats, Daniel Gurevich, Yosef E. Maruvka, Moti Freiman

TL;DR

The improved accuracy of CNN models for CRC subtype classification that account for potential correlation between genomic variations within CRC subtypes and their corresponding cellular morphology as expressed by H&E imaging phenotypes may elucidate the biological cues impacting cancer histopathological imaging phenotypes.

Abstract

Molecular subtypes of colorectal cancer (CRC) significantly influence treatment decisions. While convolutional neural networks (CNNs) have recently been introduced for automated CRC subtype identification using H&E stained histopathological images, the correlation between CRC subtype genomic variants and their corresponding cellular morphology expressed by their imaging phenotypes is yet to be fully explored. The goal of this study was to determine such correlations by incorporating genomic variants in CNN models for CRC subtype classification from H&E images. We utilized the publicly available TCGA-CRC-DX dataset, which comprises whole slide images from 360 CRC-diagnosed patients (260 for training and 100 for testing). This dataset also provides information on CRC subtype classifications and genomic variations. We trained CNN models for CRC subtype classification that account for potential correlation between genomic variations within CRC subtypes and their corresponding cellular morphology patterns. We assessed the interplay between CRC subtypes' genomic variations and cellular morphology patterns by evaluating the CRC subtype classification accuracy of the different models in a stratified 5-fold cross-validation experimental setup using the area under the ROC curve (AUROC) and average precision (AP) as the performance metrics. Combining the CNN models account for variations in CIMP and SNP further improved classification accuracy (AUROC: 0.847$\pm$0.01 vs. 0.787$\pm$0.03, p$=$0.01, AP: 0.68$\pm$0.02 vs. 0.64$\pm$0.05).

Exploring the Interplay Between Colorectal Cancer Subtypes Genomic Variants and Cellular Morphology: A Deep-Learning Approach

TL;DR

The improved accuracy of CNN models for CRC subtype classification that account for potential correlation between genomic variations within CRC subtypes and their corresponding cellular morphology as expressed by H&E imaging phenotypes may elucidate the biological cues impacting cancer histopathological imaging phenotypes.

Abstract

Molecular subtypes of colorectal cancer (CRC) significantly influence treatment decisions. While convolutional neural networks (CNNs) have recently been introduced for automated CRC subtype identification using H&E stained histopathological images, the correlation between CRC subtype genomic variants and their corresponding cellular morphology expressed by their imaging phenotypes is yet to be fully explored. The goal of this study was to determine such correlations by incorporating genomic variants in CNN models for CRC subtype classification from H&E images. We utilized the publicly available TCGA-CRC-DX dataset, which comprises whole slide images from 360 CRC-diagnosed patients (260 for training and 100 for testing). This dataset also provides information on CRC subtype classifications and genomic variations. We trained CNN models for CRC subtype classification that account for potential correlation between genomic variations within CRC subtypes and their corresponding cellular morphology patterns. We assessed the interplay between CRC subtypes' genomic variations and cellular morphology patterns by evaluating the CRC subtype classification accuracy of the different models in a stratified 5-fold cross-validation experimental setup using the area under the ROC curve (AUROC) and average precision (AP) as the performance metrics. Combining the CNN models account for variations in CIMP and SNP further improved classification accuracy (AUROC: 0.8470.01 vs. 0.7870.03, p0.01, AP: 0.680.02 vs. 0.640.05).
Paper Structure (19 sections, 1 equation, 12 figures, 1 algorithm)

This paper contains 19 sections, 1 equation, 12 figures, 1 algorithm.

Figures (12)

  • Figure 1: Summary of the TCGA COAD and READ datasets application: The total cohort encompasses n=632 patients. Some patients were excluded due to technical reasons, resulting with n=430 patients. Out of this, Kather et al. Kather2019DeepCancer pre-processed and published data for n=360 patients, segmenting them into a training and a testing set. The training set was balanced at the patch (p) level. For our research, we used stratified cross-validation folds at the patient level. The partitioning into these folds was informed by the novel sub-labels based on SNP rates and CIMP classifications.
  • Figure 2: Model Architectures. (a) Baseline Model Architecture: Patches are input into the Inception-Net DBLP:journals/corr/SzegedyVISW15 for feature extraction, with the last two layers acting as fully connected classifier layers. Outputs are propagated to a softmax layer for determining probabilities. N represents the number of patient patches, while $P_i$ denotes the MSI probability for each patch. The MSI score for each patient, $P_w$, is the average of its corresponding MSI probabilities. (b) Biologically-Primed Model Architecture: Similar to the baseline model, the softmax layer outputs class probabilities at the patch level. However, the MSI probability here is calculated as the maximum value between $MSI_1$ and $MSI_2$ outputs. The calculation of $P_w$ remains the same as in the baseline model.
  • Figure 3: Experimental flow for our exploration of the interplay between CRC subtypes genomic variants and cellular morphology. The TCGA-CRC dataset, pre-processed by Kather et al. Kather2019DeepCancer (N=360) is split into different sets for analysis. A baseline model is trained, and based on its results, a molecular feature analysis is performed. Based on the analysis we choose to define our data classes based on the ranges and categories of SNP, CIMP and CNV (the BP class definitions step). After the definition, we divide the classes into five stratified folds. Next, three models are trained: BP-CNN$_\text{CIMP}$, BP-CNN$_\text{SNP}$, and BP-CNN$_\text{CNV}$ to evaluate the interplay between genomic variations and cellular morphology. The BP-CNN$_\text{CIMP}$, BP-CNN$_\text{SNP}$, and BP-CNN$_\text{CNV}$ models classify the data based on CIMP, SNP, and CNV features, respectively. Based on their results, BP-CNN$_\text{CIMP}$ and BP-CNN$_\text{SNP}$ are further combined into BP-CNN$_\text{Combined}$ to incorporate the entire set of genomic variations identified as influencing cellular morphology.
  • Figure 4: Our BP-CNNCombined model. Models A and B represent biologically-primed models informed by two distinct genomic variations. The network outputs from trained and fixed models A and B are concatenated, fed into a linear layer, and then propagated to a softmax layer to yield probabilities. 'N' represents the number of patches for each patient, and $P_i$ indicates the corresponding MSI probabilities for these patches. The MSI score for each patient denoted as $P_w$, is derived from averaging its respective MSI probabilities.
  • Figure 5: Baseline model results for per-patient classification of the test set validated over 5-folds. Average and 95% CI curves: (a) ROC curve, (b) PR curve.
  • ...and 7 more figures