STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomics
Jiawen Chen, Muqing Zhou, Wenrong Wu, Jinwei Zhang, Yun Li, Didong Li
TL;DR
STimage-1K4M presents a large, open-source dataset linking histopathology sub-tiles with full spatial transcriptomics profiles across multiple species and tissues, enabling fine-grained multi-modal learning. The authors curate 1,149 slides and over 4.29 million spot-tiles, each with $15,000$-$30,000$ genes, drawn from Spatial Transcriptomics, Visium, and VisiumHD, and provide pathologist annotations. They demonstrate a contrastive learning approach that fuses image and gene-expression signals to produce improved embeddings and potential for gene-expression inference and spatial clustering. This dataset fills a critical gap in computational pathology by enabling joint image-gene analyses at sub-tile resolution with broad applicability to cancer research and tissue biology.
Abstract
Recent advances in multi-modal algorithms have driven and been driven by the increasing availability of large image-text datasets, leading to significant strides in various fields, including computational pathology. However, in most existing medical image-text datasets, the text typically provides high-level summaries that may not sufficiently describe sub-tile regions within a large pathology image. For example, an image might cover an extensive tissue area containing cancerous and healthy regions, but the accompanying text might only specify that this image is a cancer slide, lacking the nuanced details needed for in-depth analysis. In this study, we introduce STimage-1K4M, a novel dataset designed to bridge this gap by providing genomic features for sub-tile images. STimage-1K4M contains 1,149 images derived from spatial transcriptomics data, which captures gene expression information at the level of individual spatial spots within a pathology image. Specifically, each image in the dataset is broken down into smaller sub-image tiles, with each tile paired with 15,000-30,000 dimensional gene expressions. With 4,293,195 pairs of sub-tile images and gene expressions, STimage-1K4M offers unprecedented granularity, paving the way for a wide range of advanced research in multi-modal data analysis an innovative applications in computational pathology, and beyond.
