Table of Contents
Fetching ...

RankByGene: Gene-Guided Histopathology Representation Learning Through Cross-Modal Ranking Consistency

Wentao Huang, Meilong Xu, Xiaoling Hu, Shahira Abousamra, Aniruddha Ganguly, Saarthak Kapse, Alisa Yurovsky, Prateek Prasanna, Tahsin Kurc, Joel Saltz, Michael L. Miller, Chao Chen

TL;DR

This work tackles the challenge of aligning spatial transcriptomics with histopathology by learning gene-guided image representations through cross-modal ranking. It introduces RankByGene, which combines a gene-image contrastive loss with a cross-modal ranking consistency loss and a self-supervised intra-modal distillation to achieve robust, scalable multi-scale alignment. Across seven public datasets, RankByGene yields superior performance in gene expression prediction, slide-level classification, and survival analysis, demonstrating stronger cross-modal alignment and resilience to noise and sparsity in ST data. The approach offers a practical foundation for multi-modal pathology, enabling more accurate prognostic and diagnostic insights by leveraging gene-driven image representations in histopathology analyses.

Abstract

Spatial transcriptomics (ST) provides essential spatial context by mapping gene expression within tissue, enabling detailed study of cellular heterogeneity and tissue organization. However, aligning ST data with histology images poses challenges due to inherent spatial distortions and modality-specific variations. Existing methods largely rely on direct alignment, which often fails to capture complex cross-modal relationships. To address these limitations, we propose a novel framework that aligns gene and image features using a ranking-based alignment loss, preserving relative similarity across modalities and enabling robust multi-scale alignment. To further enhance the alignment's stability, we employ self-supervised knowledge distillation with a teacher-student network architecture, effectively mitigating disruptions from high dimensionality, sparsity, and noise in gene expression data. Extensive experiments on seven public datasets that encompass gene expression prediction, slide-level classification, and survival analysis demonstrate the efficacy of our method, showing improved alignment and predictive performance over existing methods.

RankByGene: Gene-Guided Histopathology Representation Learning Through Cross-Modal Ranking Consistency

TL;DR

This work tackles the challenge of aligning spatial transcriptomics with histopathology by learning gene-guided image representations through cross-modal ranking. It introduces RankByGene, which combines a gene-image contrastive loss with a cross-modal ranking consistency loss and a self-supervised intra-modal distillation to achieve robust, scalable multi-scale alignment. Across seven public datasets, RankByGene yields superior performance in gene expression prediction, slide-level classification, and survival analysis, demonstrating stronger cross-modal alignment and resilience to noise and sparsity in ST data. The approach offers a practical foundation for multi-modal pathology, enabling more accurate prognostic and diagnostic insights by leveraging gene-driven image representations in histopathology analyses.

Abstract

Spatial transcriptomics (ST) provides essential spatial context by mapping gene expression within tissue, enabling detailed study of cellular heterogeneity and tissue organization. However, aligning ST data with histology images poses challenges due to inherent spatial distortions and modality-specific variations. Existing methods largely rely on direct alignment, which often fails to capture complex cross-modal relationships. To address these limitations, we propose a novel framework that aligns gene and image features using a ranking-based alignment loss, preserving relative similarity across modalities and enabling robust multi-scale alignment. To further enhance the alignment's stability, we employ self-supervised knowledge distillation with a teacher-student network architecture, effectively mitigating disruptions from high dimensionality, sparsity, and noise in gene expression data. Extensive experiments on seven public datasets that encompass gene expression prediction, slide-level classification, and survival analysis demonstrate the efficacy of our method, showing improved alignment and predictive performance over existing methods.

Paper Structure

This paper contains 29 sections, 11 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: t-SNE van2008visualizing visualization of image features of different spots in a ST slide. We show the feature learned using different methods, including (a) SSL on natural images, (b) SSL on histopathology images, (c) CL on ST data using InfoNCE loss jaume2024hest, and (d) RankByGene. Learning with gene information (c) clearly outperforms learning with images alone ((a) and (b)). Furthermore, our method (d) achieves even greater improvement, demonstrating better inter-class separability than (c), thanks to the proposed contributions. Each spot is assigned a label via K-means clustering on gene expression values; the corresponding image features are then color-coded by these labels. We also provide quantitative measures using v-score rosenberg2007v, where a higher value indicates better alignment of image features with gene expression.
  • Figure 2: Overview of our RankbyGene framework. The framework begins with WSI Tiling, where WSIs are cut into patches, each paired with a gene spot. In feature extraction, weak and strong augmentations of the patches are processed through a teacher and student encoder, while a gene encoder extracts features from the gene profile. The feature alignment stage ensures that weakly and strongly augmented image features are aligned through intra-modal distillation loss and the image and gene features are aligned using gene-image contrastive loss. Meanwhile, our proposed cross-modal ranking consistency loss maintains consistent similarity ranking across two modalities.
  • Figure 3: Illustration of the ranking loss intuition, when the gene features $g_q$ is closer to $g_p$ than $g_r$. Note the similarity is inverse proportional to the distance. Left:$i_q$ is also closer to $i_p$ than $i_r$, and furthermore, and the gap between image feature similarities $S_{p,q}^I$ and $S_{p,r}^I$ is bigger than the gap between gene feature similarities $S_{p,q}^I$ and $S_{p,r}^I$. This is the desirable case where $\ell(p,q,r)$ is negative. Middle: when the similarity ranking is the same between gene and image features, but when the gap between image feature similarities is smaller than the gap between gene feature similarities. $\ell(p,q,r)$ is positive, and will incur penalty. Right: when the similarity order is inconsistent, $\ell(p,q,r)$ is positive. Undesirable.
  • Figure 4: Comparison of gene-image distances from 100 randomly sampled spot pairs. Each point represents the gene and image distance between two spots. A higher $R^2$nagelkerke1991note indicates a stronger linear correlation, suggesting better alignment between gene and image features.
  • Figure 5: Visualization of FASN gene expression predictions from different methods, with all values normalized to the range of 0 to 1.
  • ...and 4 more figures