Table of Contents
Fetching ...

Improving Performance in Colorectal Cancer Histology Decomposition using Deep and Ensemble Machine Learning

Fabi Prezja, Leevi Annala, Sampsa Kiiskinen, Suvi Lahtinen, Timo Ojala, Pekka Ruusuvuori, Teijo Kuopio

TL;DR

A hybrid deep transfer learning and ensemble machine learning model is introduced that improves upon previous approaches, including a transformer and neural architecture search baseline for this task, and employs a pairing of the EfficientNetV2 architecture with a random forest classification head.

Abstract

In routine colorectal cancer management, histologic samples stained with hematoxylin and eosin are commonly used. Nonetheless, their potential for defining objective biomarkers for patient stratification and treatment selection is still being explored. The current gold standard relies on expensive and time-consuming genetic tests. However, recent research highlights the potential of convolutional neural networks (CNNs) in facilitating the extraction of clinically relevant biomarkers from these readily available images. These CNN-based biomarkers can predict patient outcomes comparably to golden standards, with the added advantages of speed, automation, and minimal cost. The predictive potential of CNN-based biomarkers fundamentally relies on the ability of convolutional neural networks (CNNs) to classify diverse tissue types from whole slide microscope images accurately. Consequently, enhancing the accuracy of tissue class decomposition is critical to amplifying the prognostic potential of imaging-based biomarkers. This study introduces a hybrid Deep and ensemble machine learning model that surpassed all preceding solutions for this classification task. Our model achieved 96.74% accuracy on the external test set and 99.89% on the internal test set. Recognizing the potential of these models in advancing the task, we have made them publicly available for further research and development.

Improving Performance in Colorectal Cancer Histology Decomposition using Deep and Ensemble Machine Learning

TL;DR

A hybrid deep transfer learning and ensemble machine learning model is introduced that improves upon previous approaches, including a transformer and neural architecture search baseline for this task, and employs a pairing of the EfficientNetV2 architecture with a random forest classification head.

Abstract

In routine colorectal cancer management, histologic samples stained with hematoxylin and eosin are commonly used. Nonetheless, their potential for defining objective biomarkers for patient stratification and treatment selection is still being explored. The current gold standard relies on expensive and time-consuming genetic tests. However, recent research highlights the potential of convolutional neural networks (CNNs) in facilitating the extraction of clinically relevant biomarkers from these readily available images. These CNN-based biomarkers can predict patient outcomes comparably to golden standards, with the added advantages of speed, automation, and minimal cost. The predictive potential of CNN-based biomarkers fundamentally relies on the ability of convolutional neural networks (CNNs) to classify diverse tissue types from whole slide microscope images accurately. Consequently, enhancing the accuracy of tissue class decomposition is critical to amplifying the prognostic potential of imaging-based biomarkers. This study introduces a hybrid Deep and ensemble machine learning model that surpassed all preceding solutions for this classification task. Our model achieved 96.74% accuracy on the external test set and 99.89% on the internal test set. Recognizing the potential of these models in advancing the task, we have made them publicly available for further research and development.
Paper Structure (16 sections, 8 equations, 9 figures, 3 tables)

This paper contains 16 sections, 8 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Methodological Pipeline for Model Development. Blue arrows indicate data flow, red arrows represent model transfer, and numerical markers show the operational sequence.
  • Figure 2: Figure \ref{['pie']} illustrates the distribution of data from the source kather2019predicting. On the left, the inner ring displays the percentage of data allocated for training, along with class labels on the outer ring. The right side shows a similar breakdown for the external-testing data.
  • Figure 3: Tissue sample tiles and distribution across dataset partitions for nine tissue types. Each row represents a different dataset partition (Training, Validation, Test, and External Test), denoted as Train, Val, Test, and Ext. Test, respectively. Below each tissue type column are bar graphs displaying the count of images in each dataset partition.
  • Figure 4: Diagram illustrating the integrated architecture of the base EfficientNetV2 and our modifications after the pooling layer
  • Figure 5: Benchmarking EfficientNetV2M + RF hybrid vs. Autokeras NAS using one-vs-all ROC curves.
  • ...and 4 more figures