Table of Contents
Fetching ...

From Spots to Pixels: Dense Spatial Gene Expression Prediction from Histology Images

Ruikun Zhang, Yan Yang, Liyuan Pan

TL;DR

PixNet tackles the problem of spatial gene expression prediction by moving from spot-wise regression on fixed crops to dense, pixel-wise mapping from histology images. It builds a multi-scale pyramidal feature extractor and a U-Net–style decoder to produce a dense gene expression map G, then aggregates values within circular ROIs to predict expression for arbitrary spots, all trained with sparse supervision. Across four ST datasets and multiple scales, PixNet achieves state-of-the-art PCC-based metrics and demonstrates robust cross-scale generalization (e.g., from $100\,μm$ training to $2\,μm$ testing), while ablations highlight the importance of the SAFB module, joint loss, and a foundation encoder like UNI2. This approach facilitates accurate, scalable spatial transcriptomics analyses directly from standard histology images and has potential to enhance downstream tissue molecular profiling and clinical interpretation; the authors also plan to release the source code publicly.

Abstract

Spatial transcriptomics (ST) measures gene expression at fine-grained spatial resolution, offering insights into tissue molecular landscapes. Previous methods for spatial gene expression prediction typically crop spots of interest from histopathology slide images, and train models to map each spot to a corresponding gene expression profile. However, these methods inherently lose the spatial resolution in gene expression: 1) each spot often contains multiple cells with distinct gene expression profiles; 2) spots are typically defined at fixed spatial resolutions, limiting the ability to predict gene expression at varying scales. To address these limitations, this paper presents PixNet, a dense prediction network capable of predicting spatially resolved gene expression across spots of varying sizes and scales directly from histopathology slide images. Different from previous methods that map individual spots to gene expression values, we generate a spatially dense continuous gene expression map from the histopathology slide image, and aggregate values within spots of interest to predict the gene expression. Our PixNet outperforms state-of-the-art methods on four common ST datasets in multiple spatial scales. The source code will be publicly available.

From Spots to Pixels: Dense Spatial Gene Expression Prediction from Histology Images

TL;DR

PixNet tackles the problem of spatial gene expression prediction by moving from spot-wise regression on fixed crops to dense, pixel-wise mapping from histology images. It builds a multi-scale pyramidal feature extractor and a U-Net–style decoder to produce a dense gene expression map G, then aggregates values within circular ROIs to predict expression for arbitrary spots, all trained with sparse supervision. Across four ST datasets and multiple scales, PixNet achieves state-of-the-art PCC-based metrics and demonstrates robust cross-scale generalization (e.g., from training to testing), while ablations highlight the importance of the SAFB module, joint loss, and a foundation encoder like UNI2. This approach facilitates accurate, scalable spatial transcriptomics analyses directly from standard histology images and has potential to enhance downstream tissue molecular profiling and clinical interpretation; the authors also plan to release the source code publicly.

Abstract

Spatial transcriptomics (ST) measures gene expression at fine-grained spatial resolution, offering insights into tissue molecular landscapes. Previous methods for spatial gene expression prediction typically crop spots of interest from histopathology slide images, and train models to map each spot to a corresponding gene expression profile. However, these methods inherently lose the spatial resolution in gene expression: 1) each spot often contains multiple cells with distinct gene expression profiles; 2) spots are typically defined at fixed spatial resolutions, limiting the ability to predict gene expression at varying scales. To address these limitations, this paper presents PixNet, a dense prediction network capable of predicting spatially resolved gene expression across spots of varying sizes and scales directly from histopathology slide images. Different from previous methods that map individual spots to gene expression values, we generate a spatially dense continuous gene expression map from the histopathology slide image, and aggregate values within spots of interest to predict the gene expression. Our PixNet outperforms state-of-the-art methods on four common ST datasets in multiple spatial scales. The source code will be publicly available.

Paper Structure

This paper contains 24 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of fields. (a) Existing approaches treat spatial gene expression prediction as a regression problem, training various networks on fixed crops from a slide image. (b) Our method formulates it as a dense prediction task, generating a gene expression map and aggregating values within spots of interest.
  • Figure 2: Overview of our framework. We extract a pyramidal feature map $\{\mathbf{F}_{l}\}_{l=1}^{L}$ from a slide image $\mathbf{I}$ and progressively decode them into a gene expression map $\mathbf{G}$ using multiple separable attention fusion blocks (SAFBs). The predicted gene expression values $\{\hat{y}_{n}\}_{n=1}^{N}$ are aggregated (Eq. \ref{['eq_sum']}) from $\mathbf{G}$ based on the positions and radiuses of spots of interest. The loss $\mathcal{L}$ is computed sparsely on spots with ground truth gene expression $\{y_{n}\}_{n=1}^{N}$ during training.
  • Figure 3: Examples of predicted expression of gene types that are related to cancers stnetApp_0App_1XBP1_0XBP1_1FASN_0FASN_1. From left to right, we show the slide image, ground truth gene expression, and predictions from various methods are shown for regions cropped from the colored boxes in the slide image.
  • Figure 4: Ablation study of the pretrained image encoder ResNet_pretrainclipUNIzimmermann2024virchow2hoptimus0.