Table of Contents
Fetching ...

STARC-9: A Large-scale Dataset for Multi-Class Tissue Classification for CRC Histopathology

Barathi Subramanian, Rathinaraja Jeyaraj, Mitchell Nevin Peterson, Terry Guo, Nigam Shah, Curtis Langlotz, Andrew Y. Ng, Jeanne Shen

TL;DR

STARC-9 tackles the scarcity of diverse, high-quality CRC histopathology datasets for robust tissue-type classification by introducing a large-scale 630,000-tile dataset across nine classes, constructed with the semi-automated DeepCluster++ framework. The pipeline combines a CRC-specific autoencoder, morphology-guided clustering, equal-frequency sampling, and expert verification to maximize intra-class diversity while minimizing manual labeling. Comprehensive benchmarking shows STARC-9-trained models consistently outperform those trained on prior datasets across classification and tumor segmentation, with strong generalization on independent validation sets and external cohorts. The framework is designed to be adaptable to other WSIs and cancer types, offering a scalable path toward more generalizable AI-assisted pathology, while highlighting the need for broader multi-institutional data and multi-modal extensions.

Abstract

Multi-class tissue-type classification of colorectal cancer (CRC) histopathologic images is a significant step in the development of downstream machine learning models for diagnosis and treatment planning. However, existing public CRC datasets often lack morphologic diversity, suffer from class imbalance, and contain low-quality image tiles, limiting model performance and generalizability. To address these issues, we introduce STARC-9 (STAnford coloRectal Cancer), a large-scale dataset for multi-class tissue classification. STARC-9 contains 630,000 hematoxylin and eosin-stained image tiles uniformly sampled across nine clinically relevant tissue classes (70,000 tiles per class) from 200 CRC patients at the Stanford University School of Medicine. The dataset was built using a novel framework, DeepCluster++, designed to ensure intra-class diversity and reduce manual curation. First, an encoder from a histopathology-specific autoencoder extracts feature vectors from tiles within each whole-slide image. Then, K-means clustering groups morphologically similar tiles, followed by equal-frequency binning to sample diverse morphologic patterns within each class. The selected tiles are subsequently verified by expert gastrointestinal pathologists to ensure accuracy. This semi-automated process significantly reduces manual effort while producing high-quality, diverse tiles. To evaluate STARC-9, we benchmarked convolutional neural networks, transformers, and pathology-specific foundation models on multi-class CRC tissue classification and segmentation tasks, showing superior generalizability compared to models trained on existing datasets. Although we demonstrate the utility of DeepCluster++ on CRC as a pilot use-case, it is a flexible framework that can be used for constructing high-quality datasets from large WSI repositories across a wide range of cancer and non-cancer applications.

STARC-9: A Large-scale Dataset for Multi-Class Tissue Classification for CRC Histopathology

TL;DR

STARC-9 tackles the scarcity of diverse, high-quality CRC histopathology datasets for robust tissue-type classification by introducing a large-scale 630,000-tile dataset across nine classes, constructed with the semi-automated DeepCluster++ framework. The pipeline combines a CRC-specific autoencoder, morphology-guided clustering, equal-frequency sampling, and expert verification to maximize intra-class diversity while minimizing manual labeling. Comprehensive benchmarking shows STARC-9-trained models consistently outperform those trained on prior datasets across classification and tumor segmentation, with strong generalization on independent validation sets and external cohorts. The framework is designed to be adaptable to other WSIs and cancer types, offering a scalable path toward more generalizable AI-assisted pathology, while highlighting the need for broader multi-institutional data and multi-modal extensions.

Abstract

Multi-class tissue-type classification of colorectal cancer (CRC) histopathologic images is a significant step in the development of downstream machine learning models for diagnosis and treatment planning. However, existing public CRC datasets often lack morphologic diversity, suffer from class imbalance, and contain low-quality image tiles, limiting model performance and generalizability. To address these issues, we introduce STARC-9 (STAnford coloRectal Cancer), a large-scale dataset for multi-class tissue classification. STARC-9 contains 630,000 hematoxylin and eosin-stained image tiles uniformly sampled across nine clinically relevant tissue classes (70,000 tiles per class) from 200 CRC patients at the Stanford University School of Medicine. The dataset was built using a novel framework, DeepCluster++, designed to ensure intra-class diversity and reduce manual curation. First, an encoder from a histopathology-specific autoencoder extracts feature vectors from tiles within each whole-slide image. Then, K-means clustering groups morphologically similar tiles, followed by equal-frequency binning to sample diverse morphologic patterns within each class. The selected tiles are subsequently verified by expert gastrointestinal pathologists to ensure accuracy. This semi-automated process significantly reduces manual effort while producing high-quality, diverse tiles. To evaluate STARC-9, we benchmarked convolutional neural networks, transformers, and pathology-specific foundation models on multi-class CRC tissue classification and segmentation tasks, showing superior generalizability compared to models trained on existing datasets. Although we demonstrate the utility of DeepCluster++ on CRC as a pilot use-case, it is a flexible framework that can be used for constructing high-quality datasets from large WSI repositories across a wide range of cancer and non-cancer applications.

Paper Structure

This paper contains 27 sections, 27 figures, 6 tables.

Figures (27)

  • Figure 1: Overview of STARC-9 large-scale dataset generation.
  • Figure 2: DeepCluster++ framework (Phases 1 and 2) followed by pathologist verification (Phase 3).
  • Figure 3: Feature map visualizations for the best models trained on HMU, NCT, and STARC-9.
  • Figure 4: Tumor segmentation within 2048x2048 regions from a WSI from the CURATED-TCGA-CRC-HE-VAL dataset using tile-level classifiers trained on HMU, NCT, and STARC-9.
  • Figure 5: STARC-9 patient demographic details.
  • ...and 22 more figures