Table of Contents
Fetching ...

STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomics

Jiawen Chen, Muqing Zhou, Wenrong Wu, Jinwei Zhang, Yun Li, Didong Li

TL;DR

STimage-1K4M presents a large, open-source dataset linking histopathology sub-tiles with full spatial transcriptomics profiles across multiple species and tissues, enabling fine-grained multi-modal learning. The authors curate 1,149 slides and over 4.29 million spot-tiles, each with $15,000$-$30,000$ genes, drawn from Spatial Transcriptomics, Visium, and VisiumHD, and provide pathologist annotations. They demonstrate a contrastive learning approach that fuses image and gene-expression signals to produce improved embeddings and potential for gene-expression inference and spatial clustering. This dataset fills a critical gap in computational pathology by enabling joint image-gene analyses at sub-tile resolution with broad applicability to cancer research and tissue biology.

Abstract

Recent advances in multi-modal algorithms have driven and been driven by the increasing availability of large image-text datasets, leading to significant strides in various fields, including computational pathology. However, in most existing medical image-text datasets, the text typically provides high-level summaries that may not sufficiently describe sub-tile regions within a large pathology image. For example, an image might cover an extensive tissue area containing cancerous and healthy regions, but the accompanying text might only specify that this image is a cancer slide, lacking the nuanced details needed for in-depth analysis. In this study, we introduce STimage-1K4M, a novel dataset designed to bridge this gap by providing genomic features for sub-tile images. STimage-1K4M contains 1,149 images derived from spatial transcriptomics data, which captures gene expression information at the level of individual spatial spots within a pathology image. Specifically, each image in the dataset is broken down into smaller sub-image tiles, with each tile paired with 15,000-30,000 dimensional gene expressions. With 4,293,195 pairs of sub-tile images and gene expressions, STimage-1K4M offers unprecedented granularity, paving the way for a wide range of advanced research in multi-modal data analysis an innovative applications in computational pathology, and beyond.

STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomics

TL;DR

STimage-1K4M presents a large, open-source dataset linking histopathology sub-tiles with full spatial transcriptomics profiles across multiple species and tissues, enabling fine-grained multi-modal learning. The authors curate 1,149 slides and over 4.29 million spot-tiles, each with - genes, drawn from Spatial Transcriptomics, Visium, and VisiumHD, and provide pathologist annotations. They demonstrate a contrastive learning approach that fuses image and gene-expression signals to produce improved embeddings and potential for gene-expression inference and spatial clustering. This dataset fills a critical gap in computational pathology by enabling joint image-gene analyses at sub-tile resolution with broad applicability to cancer research and tissue biology.

Abstract

Recent advances in multi-modal algorithms have driven and been driven by the increasing availability of large image-text datasets, leading to significant strides in various fields, including computational pathology. However, in most existing medical image-text datasets, the text typically provides high-level summaries that may not sufficiently describe sub-tile regions within a large pathology image. For example, an image might cover an extensive tissue area containing cancerous and healthy regions, but the accompanying text might only specify that this image is a cancer slide, lacking the nuanced details needed for in-depth analysis. In this study, we introduce STimage-1K4M, a novel dataset designed to bridge this gap by providing genomic features for sub-tile images. STimage-1K4M contains 1,149 images derived from spatial transcriptomics data, which captures gene expression information at the level of individual spatial spots within a pathology image. Specifically, each image in the dataset is broken down into smaller sub-image tiles, with each tile paired with 15,000-30,000 dimensional gene expressions. With 4,293,195 pairs of sub-tile images and gene expressions, STimage-1K4M offers unprecedented granularity, paving the way for a wide range of advanced research in multi-modal data analysis an innovative applications in computational pathology, and beyond.
Paper Structure (13 sections, 4 figures)

This paper contains 13 sections, 4 figures.

Figures (4)

  • Figure 1: Overview of STimage-1K4M. (a) Curation overview. (b) ST technologies resolution. (c) Breakdown of technologies, species and tissue types in STimage-1K4M.
  • Figure 2: Popular tasks in ST data analysis.
  • Figure 3: Evaluation results. (a) Linear probing results, denoted by average macro F1 (error bars indicate standard deviations). (b) Silhouette, Calinski-Harabasz and Davies-Bouldin scores for image embeddings. (c) Histopathology image of brain sample 151675 colored by pathologist annotation. (d) t-SNE embeddings of sample 151675, colored by the same layer annotations as in (c).
  • Figure 4: Datasets with pathologist annotation. The points are colored by annotation in each dataset. The legend for mouse brain data (bottom right) are omitted for visualization.