Table of Contents
Fetching ...

CrossFusion: A Multi-Scale Cross-Attention Convolutional Fusion Model for Cancer Survival Prediction

Rustin Soraki, Huayu Wang, Joann G. Elmore, Linda Shapiro

TL;DR

Cancer survival prediction from whole slide images is challenging due to enormous size and tissue heterogeneity. CrossFusion introduces a multi-scale cross-attention framework that fuses patches from 5x, 10x, and 20x magnifications through Cross-Attention Block, Pad-Transformer, and Conv Processor, producing a prediction token for prognosis; hazards h are derived from logits l via $\mathbf{h}=\sigma(\mathbf{l})$ and survival is $\mathbf{S}=\prod (1-\mathbf{h})$. The approach achieves state-of-the-art or near state-of-the-art performance across six TCGA cancer types, with interpretable heatmaps showing region-specific decisions and clear gains when using domain-specific feature backbones such as Uni2-h. These results demonstrate CrossFusion's potential to improve prognostication and support personalized cancer treatment, while maintaining interpretability and enabling future multimodal extensions. The accompanying code availability further facilitates reproducibility and adoption in clinical research.

Abstract

Cancer survival prediction from whole slide images (WSIs) is a challenging task in computational pathology due to the large size, irregular shape, and high granularity of the WSIs. These characteristics make it difficult to capture the full spectrum of patterns, from subtle cellular abnormalities to complex tissue interactions, which are crucial for accurate prognosis. To address this, we propose CrossFusion, a novel multi-scale feature integration framework that extracts and fuses information from patches across different magnification levels. By effectively modeling both scale-specific patterns and their interactions, CrossFusion generates a rich feature set that enhances survival prediction accuracy. We validate our approach across six cancer types from public datasets, demonstrating significant improvements over existing state-of-the-art methods. Moreover, when coupled with domain-specific feature extraction backbones, our method shows further gains in prognostic performance compared to general-purpose backbones. The source code is available at: https://github.com/RustinS/CrossFusion

CrossFusion: A Multi-Scale Cross-Attention Convolutional Fusion Model for Cancer Survival Prediction

TL;DR

Cancer survival prediction from whole slide images is challenging due to enormous size and tissue heterogeneity. CrossFusion introduces a multi-scale cross-attention framework that fuses patches from 5x, 10x, and 20x magnifications through Cross-Attention Block, Pad-Transformer, and Conv Processor, producing a prediction token for prognosis; hazards h are derived from logits l via and survival is . The approach achieves state-of-the-art or near state-of-the-art performance across six TCGA cancer types, with interpretable heatmaps showing region-specific decisions and clear gains when using domain-specific feature backbones such as Uni2-h. These results demonstrate CrossFusion's potential to improve prognostication and support personalized cancer treatment, while maintaining interpretability and enabling future multimodal extensions. The accompanying code availability further facilitates reproducibility and adoption in clinical research.

Abstract

Cancer survival prediction from whole slide images (WSIs) is a challenging task in computational pathology due to the large size, irregular shape, and high granularity of the WSIs. These characteristics make it difficult to capture the full spectrum of patterns, from subtle cellular abnormalities to complex tissue interactions, which are crucial for accurate prognosis. To address this, we propose CrossFusion, a novel multi-scale feature integration framework that extracts and fuses information from patches across different magnification levels. By effectively modeling both scale-specific patterns and their interactions, CrossFusion generates a rich feature set that enhances survival prediction accuracy. We validate our approach across six cancer types from public datasets, demonstrating significant improvements over existing state-of-the-art methods. Moreover, when coupled with domain-specific feature extraction backbones, our method shows further gains in prognostic performance compared to general-purpose backbones. The source code is available at: https://github.com/RustinS/CrossFusion

Paper Structure

This paper contains 20 sections, 5 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of CrossFusion. WSIs are processed by extracting patches at 5x (coarse), 10x (source), and 20x (fine) magnifications, which are first encoded using a feature extractor and then projected into a common embedding space. The source features interact with the coarse and fine features via cross-attention blocks, and each branch is refined by Pad-Transformers. The multi-scale features are subsequently fused using a Conv Processor, and a replicated learnable class token is appended. An additional transformer block refines this token, and an MLP head produces the final survival predictions from the class tokens.
  • Figure 2: Kaplan-Meier curves of predicted high-risk (red) and low-risk (blue) groups. A P-value < 0.05 indicates statistical significance.
  • Figure 3: Generated heatmaps from the model predicting a high-risk case. The dark purple clusters mark tumor regions in the original WSI on the left and lighter yellow areas highlight important regions from the model's attention weights.