HiFusion: Hierarchical Intra-Spot Alignment and Regional Context Fusion for Spatial Gene Expression Prediction from Histopathology
Ziqiao Weng, Yaoyu Fang, Jiahe Qian, Xinkun Wang, Lee AD Cooper, Weidong Cai, Bo Zhou
TL;DR
HiFusion tackles the challenge of predicting spatial gene expression from routine histopathology by integrating multiscale intra-spot morphology with contextual regional information. It introduces Hierarchical Intra-Spot Modeling (HISM) to capture tissue-, cellular-, and subcellular-scale patterns, coupled with Context-Aware Cross-Scale Fusion (CCF) that uses region-based queries in a cross-attention mechanism to selectively fuse intra-spot features with surrounding tissue cues. The model is trained with multi-level supervision and a feature alignment loss to enforce cross-scale semantic consistency. Across two public ST datasets and both 2D slide-wise and 3D sample-specific evaluations, HiFusion consistently outperforms state-of-the-art baselines, with strong performance on clinically relevant cancer-marker genes, while offering favorable computational efficiency for practical deployment.
Abstract
Spatial transcriptomics (ST) bridges gene expression and tissue morphology but faces clinical adoption barriers due to technical complexity and prohibitive costs. While computational methods predict gene expression from H&E-stained whole-slide images (WSIs), existing approaches often fail to capture the intricate biological heterogeneity within spots and are susceptible to morphological noise when integrating contextual information from surrounding tissue. To overcome these limitations, we propose HiFusion, a novel deep learning framework that integrates two complementary components. First, we introduce the Hierarchical Intra-Spot Modeling module that extracts fine-grained morphological representations through multi-resolution sub-patch decomposition, guided by a feature alignment loss to ensure semantic consistency across scales. Concurrently, we present the Context-aware Cross-scale Fusion module, which employs cross-attention to selectively incorporate biologically relevant regional context, thereby enhancing representational capacity. This architecture enables comprehensive modeling of both cellular-level features and tissue microenvironmental cues, which are essential for accurate gene expression prediction. Extensive experiments on two benchmark ST datasets demonstrate that HiFusion achieves state-of-the-art performance across both 2D slide-wise cross-validation and more challenging 3D sample-specific scenarios. These results underscore HiFusion's potential as a robust, accurate, and scalable solution for ST inference from routine histopathology.
