Table of Contents
Fetching ...

HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image Analysis

Guillaume Jaume, Paul Doucet, Andrew H. Song, Ming Y. Lu, Cristina Almagro-Pérez, Sophia J. Wagner, Anurag J. Vaidya, Richard J. Chen, Drew F. K. Williamson, Ahrong Kim, Faisal Mahmood

TL;DR

This work introduces HEST-1k, a collection of 1,229 spatial transcriptomic profiles, each linked to a WSI and extensive metadata, and introduces the HEST-Library, a Python package designed to perform a range of actions with HEST samples.

Abstract

Spatial transcriptomics enables interrogating the molecular composition of tissue with ever-increasing resolution and sensitivity. However, costs, rapidly evolving technology, and lack of standards have constrained computational methods in ST to narrow tasks and small cohorts. In addition, the underlying tissue morphology, as reflected by H&E-stained whole slide images (WSIs), encodes rich information often overlooked in ST studies. Here, we introduce HEST-1k, a collection of 1,229 spatial transcriptomic profiles, each linked to a WSI and extensive metadata. HEST-1k was assembled from 153 public and internal cohorts encompassing 26 organs, two species (Homo Sapiens and Mus Musculus), and 367 cancer samples from 25 cancer types. HEST-1k processing enabled the identification of 2.1 million expression--morphology pairs and over 76 million nuclei. To support its development, we additionally introduce the HEST-Library, a Python package designed to perform a range of actions with HEST samples. We test HEST-1k and Library on three use cases: (1) benchmarking foundation models for pathology (HEST-Benchmark), (2) biomarker exploration, and (3) multimodal representation learning. HEST-1k, HEST-Library, and HEST-Benchmark can be freely accessed at https://github.com/mahmoodlab/hest.

HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image Analysis

TL;DR

This work introduces HEST-1k, a collection of 1,229 spatial transcriptomic profiles, each linked to a WSI and extensive metadata, and introduces the HEST-Library, a Python package designed to perform a range of actions with HEST samples.

Abstract

Spatial transcriptomics enables interrogating the molecular composition of tissue with ever-increasing resolution and sensitivity. However, costs, rapidly evolving technology, and lack of standards have constrained computational methods in ST to narrow tasks and small cohorts. In addition, the underlying tissue morphology, as reflected by H&E-stained whole slide images (WSIs), encodes rich information often overlooked in ST studies. Here, we introduce HEST-1k, a collection of 1,229 spatial transcriptomic profiles, each linked to a WSI and extensive metadata. HEST-1k was assembled from 153 public and internal cohorts encompassing 26 organs, two species (Homo Sapiens and Mus Musculus), and 367 cancer samples from 25 cancer types. HEST-1k processing enabled the identification of 2.1 million expression--morphology pairs and over 76 million nuclei. To support its development, we additionally introduce the HEST-Library, a Python package designed to perform a range of actions with HEST samples. We test HEST-1k and Library on three use cases: (1) benchmarking foundation models for pathology (HEST-Benchmark), (2) biomarker exploration, and (3) multimodal representation learning. HEST-1k, HEST-Library, and HEST-Benchmark can be freely accessed at https://github.com/mahmoodlab/hest.

Paper Structure

This paper contains 33 sections, 6 figures, 16 tables.

Figures (6)

  • Figure 1: The HEST environment.a. Overview of HEST-1k, a dataset of $n$=1,229 paired spatial transcriptomics, H&E-stained whole-slide images and metadata. "Pathological" cases refer to non-tumor/non-cancer samples; "Tumor" refers to non-cancer samples. b. Overview of HEST-Library functionalities. c., d., e. Applications of HEST-1k include benchmarking foundation models for histology (c.), biomarker exploration (d.) and multimodal representation learning (e.).
  • Figure 2: Scaling laws in HEST-Benchmark.a. Model scaling law comparing the number of training parameters in the vision encoder (log-scale) and the average performance on the HEST-Benchmark. Pearson correlation between parameters and performance of R=0.81 (P-value < 0.01). b. Data scaling law comparing the number of image patches used for pretraining (log-scale) and the average performance on the HEST-Benchmark. Pearson correlation between number of patches and performance of R=0.48 (P-value=0.13).
  • Figure 3: HEST for biomarker exploration: Analysis of an invasive ductal carcinoma sample imaged with Xenium.a. IDC Xenium sample with neoplastic nuclei overlaid in red ($n_c=168,033$ detected nuclei). Gray scale bar represents 2 mm. b. Heatmap of Xenium expression of gene GATA3. Blue and red values indicate above and below the mean (in white), respectively. c. Heatmap of neoplastic nuclear area. d. Four randomly selected regions with CellViT segmentation of the neoplastic nuclei. Black scale bar represents 30 $\mu$m. e., f. Correlation between nuclear area and GATA3, and minor axis length and MYBPC1.
  • Figure 4: Overview of HEST-Library functionalities. HEST was designed to transform legacy data scrapped in multiple public repositories, such as NCBI, into unified HEST objects that can easily be integrated into computational pipelines.
  • Figure 5: Fiducial detection and automatic alignment in Visium. Corner fiducials on 6.5$\times$6.5mm and 11mm$\times$11mm Visium slides are automatically detected with a finetuned Yolov8 model. The spot coordinates are then derived if at least 3 of the 4 corner fiducials are detected. This process enables automatically estimating the pixel resolution.
  • ...and 1 more figures