M2OST: Many-to-one Regression for Predicting Spatial Transcriptomics from Digital Pathology Images
Hongyi Wang, Xiuju Du, Jing Liu, Shuyi Ouyang, Yen-Wei Chen, Lanfen Lin
TL;DR
This work tackles the challenge of predicting spatial transcriptomics (ST) maps from digital pathology WSIs, where ST data are costly to acquire. It introduces M2OST, a many-to-one Transformer that jointly leverages multi-scale pathology patches through a decoupled encoder consisting of deformable patch embedding (DPE), intra-level token mixing (ITMM), cross-level token mixing (CTMM), and cross-level channel mixing (CCMM). Ablation and extensive experiments on three public ST datasets demonstrate that the many-to-one, multi-scale design yields state-of-the-art ST prediction performance with substantially fewer parameters and FLOPs, while effectively integrating inter-spot and multi-scale information. The approach offers a practical, end-to-end framework for cost-effective ST map generation, with strong generalization and scalability to different input configurations. Overall, M2OST advances multi-scale computational pathology by efficiently fusing hierarchical image information to predict spatial gene expressions with high fidelity.
Abstract
The advancement of Spatial Transcriptomics (ST) has facilitated the spatially-aware profiling of gene expressions based on histopathology images. Although ST data offers valuable insights into the micro-environment of tumors, its acquisition cost remains expensive. Therefore, directly predicting the ST expressions from digital pathology images is desired. Current methods usually adopt existing regression backbones along with patch-sampling for this task, which ignores the inherent multi-scale information embedded in the pyramidal data structure of digital pathology images, and wastes the inter-spot visual information crucial for accurate gene expression prediction. To address these limitations, we propose M2OST, a many-to-one regression Transformer that can accommodate the hierarchical structure of the pathology images via a decoupled multi-scale feature extractor. Unlike traditional models that are trained with one-to-one image-label pairs, M2OST uses multiple images from different levels of the digital pathology image to jointly predict the gene expressions in their common corresponding spot. Built upon our many-to-one scheme, M2OST can be easily scaled to fit different numbers of inputs, and its network structure inherently incorporates nearby inter-spot features, enhancing regression performance. We have tested M2OST on three public ST datasets and the experimental results show that M2OST can achieve state-of-the-art performance with fewer parameters and floating-point operations (FLOPs).
