Table of Contents
Fetching ...

HiFusion: Hierarchical Intra-Spot Alignment and Regional Context Fusion for Spatial Gene Expression Prediction from Histopathology

Ziqiao Weng, Yaoyu Fang, Jiahe Qian, Xinkun Wang, Lee AD Cooper, Weidong Cai, Bo Zhou

TL;DR

HiFusion tackles the challenge of predicting spatial gene expression from routine histopathology by integrating multiscale intra-spot morphology with contextual regional information. It introduces Hierarchical Intra-Spot Modeling (HISM) to capture tissue-, cellular-, and subcellular-scale patterns, coupled with Context-Aware Cross-Scale Fusion (CCF) that uses region-based queries in a cross-attention mechanism to selectively fuse intra-spot features with surrounding tissue cues. The model is trained with multi-level supervision and a feature alignment loss to enforce cross-scale semantic consistency. Across two public ST datasets and both 2D slide-wise and 3D sample-specific evaluations, HiFusion consistently outperforms state-of-the-art baselines, with strong performance on clinically relevant cancer-marker genes, while offering favorable computational efficiency for practical deployment.

Abstract

Spatial transcriptomics (ST) bridges gene expression and tissue morphology but faces clinical adoption barriers due to technical complexity and prohibitive costs. While computational methods predict gene expression from H&E-stained whole-slide images (WSIs), existing approaches often fail to capture the intricate biological heterogeneity within spots and are susceptible to morphological noise when integrating contextual information from surrounding tissue. To overcome these limitations, we propose HiFusion, a novel deep learning framework that integrates two complementary components. First, we introduce the Hierarchical Intra-Spot Modeling module that extracts fine-grained morphological representations through multi-resolution sub-patch decomposition, guided by a feature alignment loss to ensure semantic consistency across scales. Concurrently, we present the Context-aware Cross-scale Fusion module, which employs cross-attention to selectively incorporate biologically relevant regional context, thereby enhancing representational capacity. This architecture enables comprehensive modeling of both cellular-level features and tissue microenvironmental cues, which are essential for accurate gene expression prediction. Extensive experiments on two benchmark ST datasets demonstrate that HiFusion achieves state-of-the-art performance across both 2D slide-wise cross-validation and more challenging 3D sample-specific scenarios. These results underscore HiFusion's potential as a robust, accurate, and scalable solution for ST inference from routine histopathology.

HiFusion: Hierarchical Intra-Spot Alignment and Regional Context Fusion for Spatial Gene Expression Prediction from Histopathology

TL;DR

HiFusion tackles the challenge of predicting spatial gene expression from routine histopathology by integrating multiscale intra-spot morphology with contextual regional information. It introduces Hierarchical Intra-Spot Modeling (HISM) to capture tissue-, cellular-, and subcellular-scale patterns, coupled with Context-Aware Cross-Scale Fusion (CCF) that uses region-based queries in a cross-attention mechanism to selectively fuse intra-spot features with surrounding tissue cues. The model is trained with multi-level supervision and a feature alignment loss to enforce cross-scale semantic consistency. Across two public ST datasets and both 2D slide-wise and 3D sample-specific evaluations, HiFusion consistently outperforms state-of-the-art baselines, with strong performance on clinically relevant cancer-marker genes, while offering favorable computational efficiency for practical deployment.

Abstract

Spatial transcriptomics (ST) bridges gene expression and tissue morphology but faces clinical adoption barriers due to technical complexity and prohibitive costs. While computational methods predict gene expression from H&E-stained whole-slide images (WSIs), existing approaches often fail to capture the intricate biological heterogeneity within spots and are susceptible to morphological noise when integrating contextual information from surrounding tissue. To overcome these limitations, we propose HiFusion, a novel deep learning framework that integrates two complementary components. First, we introduce the Hierarchical Intra-Spot Modeling module that extracts fine-grained morphological representations through multi-resolution sub-patch decomposition, guided by a feature alignment loss to ensure semantic consistency across scales. Concurrently, we present the Context-aware Cross-scale Fusion module, which employs cross-attention to selectively incorporate biologically relevant regional context, thereby enhancing representational capacity. This architecture enables comprehensive modeling of both cellular-level features and tissue microenvironmental cues, which are essential for accurate gene expression prediction. Extensive experiments on two benchmark ST datasets demonstrate that HiFusion achieves state-of-the-art performance across both 2D slide-wise cross-validation and more challenging 3D sample-specific scenarios. These results underscore HiFusion's potential as a robust, accurate, and scalable solution for ST inference from routine histopathology.

Paper Structure

This paper contains 26 sections, 10 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Schematic of the proposed HiFusion framework, which integrates Hierarchical Intra-Spot Modeling (HISM) and Context-Aware Cross-Scale Fusion (CCF). HISM hierarchically decomposes each spot into multi-scale patches to extract fine-grained features with semantic alignment. CCF fuses contextual region features with multi-scale spot representations via residual cross-attention for gene expression prediction.
  • Figure 2: Ablation study for (a) spot token number and (b) neighbor patch size.
  • Figure 3: Predicted spatial expression of ERBB2, KRT19 and CD74 by different models. MAE, PCC values with the ground truth are shown. Brighter regions indicate higher gene expression levels, while darker regions represent lower expression. HiFusion achieves the best visual and quantitative alignment.
  • Figure 4: Ablation study for (a) spot token number and (b) neighbor patch size.
  • Figure 5: Predicted spatial expression of ERBB2, KRT19, and TMSB10 on three representative samples from the HER2 dataset. HiFusion and HiFusion (3D) are compared with the ground truth. Brighter regions indicate higher gene expression. HiFusion (3D) shows better visual and quantitative alignment.
  • ...and 1 more figures