Table of Contents
Fetching ...

M2OST: Many-to-one Regression for Predicting Spatial Transcriptomics from Digital Pathology Images

Hongyi Wang, Xiuju Du, Jing Liu, Shuyi Ouyang, Yen-Wei Chen, Lanfen Lin

TL;DR

This work tackles the challenge of predicting spatial transcriptomics (ST) maps from digital pathology WSIs, where ST data are costly to acquire. It introduces M2OST, a many-to-one Transformer that jointly leverages multi-scale pathology patches through a decoupled encoder consisting of deformable patch embedding (DPE), intra-level token mixing (ITMM), cross-level token mixing (CTMM), and cross-level channel mixing (CCMM). Ablation and extensive experiments on three public ST datasets demonstrate that the many-to-one, multi-scale design yields state-of-the-art ST prediction performance with substantially fewer parameters and FLOPs, while effectively integrating inter-spot and multi-scale information. The approach offers a practical, end-to-end framework for cost-effective ST map generation, with strong generalization and scalability to different input configurations. Overall, M2OST advances multi-scale computational pathology by efficiently fusing hierarchical image information to predict spatial gene expressions with high fidelity.

Abstract

The advancement of Spatial Transcriptomics (ST) has facilitated the spatially-aware profiling of gene expressions based on histopathology images. Although ST data offers valuable insights into the micro-environment of tumors, its acquisition cost remains expensive. Therefore, directly predicting the ST expressions from digital pathology images is desired. Current methods usually adopt existing regression backbones along with patch-sampling for this task, which ignores the inherent multi-scale information embedded in the pyramidal data structure of digital pathology images, and wastes the inter-spot visual information crucial for accurate gene expression prediction. To address these limitations, we propose M2OST, a many-to-one regression Transformer that can accommodate the hierarchical structure of the pathology images via a decoupled multi-scale feature extractor. Unlike traditional models that are trained with one-to-one image-label pairs, M2OST uses multiple images from different levels of the digital pathology image to jointly predict the gene expressions in their common corresponding spot. Built upon our many-to-one scheme, M2OST can be easily scaled to fit different numbers of inputs, and its network structure inherently incorporates nearby inter-spot features, enhancing regression performance. We have tested M2OST on three public ST datasets and the experimental results show that M2OST can achieve state-of-the-art performance with fewer parameters and floating-point operations (FLOPs).

M2OST: Many-to-one Regression for Predicting Spatial Transcriptomics from Digital Pathology Images

TL;DR

This work tackles the challenge of predicting spatial transcriptomics (ST) maps from digital pathology WSIs, where ST data are costly to acquire. It introduces M2OST, a many-to-one Transformer that jointly leverages multi-scale pathology patches through a decoupled encoder consisting of deformable patch embedding (DPE), intra-level token mixing (ITMM), cross-level token mixing (CTMM), and cross-level channel mixing (CCMM). Ablation and extensive experiments on three public ST datasets demonstrate that the many-to-one, multi-scale design yields state-of-the-art ST prediction performance with substantially fewer parameters and FLOPs, while effectively integrating inter-spot and multi-scale information. The approach offers a practical, end-to-end framework for cost-effective ST map generation, with strong generalization and scalability to different input configurations. Overall, M2OST advances multi-scale computational pathology by efficiently fusing hierarchical image information to predict spatial gene expressions with high fidelity.

Abstract

The advancement of Spatial Transcriptomics (ST) has facilitated the spatially-aware profiling of gene expressions based on histopathology images. Although ST data offers valuable insights into the micro-environment of tumors, its acquisition cost remains expensive. Therefore, directly predicting the ST expressions from digital pathology images is desired. Current methods usually adopt existing regression backbones along with patch-sampling for this task, which ignores the inherent multi-scale information embedded in the pyramidal data structure of digital pathology images, and wastes the inter-spot visual information crucial for accurate gene expression prediction. To address these limitations, we propose M2OST, a many-to-one regression Transformer that can accommodate the hierarchical structure of the pathology images via a decoupled multi-scale feature extractor. Unlike traditional models that are trained with one-to-one image-label pairs, M2OST uses multiple images from different levels of the digital pathology image to jointly predict the gene expressions in their common corresponding spot. Built upon our many-to-one scheme, M2OST can be easily scaled to fit different numbers of inputs, and its network structure inherently incorporates nearby inter-spot features, enhancing regression performance. We have tested M2OST on three public ST datasets and the experimental results show that M2OST can achieve state-of-the-art performance with fewer parameters and floating-point operations (FLOPs).
Paper Structure (30 sections, 2 equations, 7 figures, 6 tables)

This paper contains 30 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: (a) WSIs are obtained by scanning the glass slide tissues at different magnifications, resulting in a multi-scale pyramid data structure. (b) ST maps are generated by sampling spots on the glass slide tissues, followed by comprehensive profiling of gene expressions within each sampled spot.
  • Figure 2: A schematic view of the proposed M2OST. Three patch sequences from different WSI levels are fed into the model to jointly predict the gene expressions in the corresponding spot. PE denotes the fully learnable positional embedding in the figure.
  • Figure 3: DPE used in M2OST. The circle area of $G$ indicates the target ST spot.
  • Figure 4: The network structure of ITMM. This module needs to be applied to each level's sequence separately.
  • Figure 5: (a) The network structure of CTMM. (b) The network structure of CCMM.
  • ...and 2 more figures