Table of Contents
Fetching ...

CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis

Di Zhang, Zhangpeng Gong, Xiaobo Pang, Jiashuai Liu, Junbo Lu, Hao Cui, Jiusong Ge, Zhi Zeng, Kai Yi, Yinghua Li, Si Liu, Tingsong Yu, Haoran Wang, Mireia Crispin-Ortuzar, eimiao Yu, Chen Li, Zeyu Gao

TL;DR

Cross-modal Adaptive Region Encoder is presented, a foundation model for pathology that automatically partitions WSIs into several morphologically relevant regions and achieves superior average performance across 33 downstream benchmarks, including morphological classification, molecular prediction, and survival analysis, and outperforms other foundation model baselines overall.

Abstract

Foundation models have recently achieved impressive success in computational pathology, demonstrating strong generalization across diverse histopathology tasks. However, existing models overlook the heterogeneous and non-uniform organization of pathological regions of interest (ROIs) because they rely on natural image backbones not tailored for tissue morphology. Consequently, they often fail to capture the coherent tissue architecture beyond isolated patches, limiting interpretability and clinical relevance. To address these challenges, we present Cross-modal Adaptive Region Encoder (CARE), a foundation model for pathology that automatically partitions WSIs into several morphologically relevant regions. Specifically, CARE employs a two-stage pretraining strategy: (1) a self-supervised unimodal pretraining stage that learns morphological representations from 34,277 whole-slide images (WSIs) without segmentation annotations, and (2) a cross-modal alignment stage that leverages RNA and protein profiles to refine the construction and representation of adaptive regions. This molecular guidance enables CARE to identify biologically relevant patterns and generate irregular yet coherent tissue regions, selecting the most representative area as ROI. CARE supports a broad range of pathology-related tasks, using either the ROI feature or the slide-level feature obtained by aggregating adaptive regions. Based on only one-tenth of the pretraining data typically used by mainstream foundation models, CARE achieves superior average performance across 33 downstream benchmarks, including morphological classification, molecular prediction, and survival analysis, and outperforms other foundation model baselines overall.

CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis

TL;DR

Cross-modal Adaptive Region Encoder is presented, a foundation model for pathology that automatically partitions WSIs into several morphologically relevant regions and achieves superior average performance across 33 downstream benchmarks, including morphological classification, molecular prediction, and survival analysis, and outperforms other foundation model baselines overall.

Abstract

Foundation models have recently achieved impressive success in computational pathology, demonstrating strong generalization across diverse histopathology tasks. However, existing models overlook the heterogeneous and non-uniform organization of pathological regions of interest (ROIs) because they rely on natural image backbones not tailored for tissue morphology. Consequently, they often fail to capture the coherent tissue architecture beyond isolated patches, limiting interpretability and clinical relevance. To address these challenges, we present Cross-modal Adaptive Region Encoder (CARE), a foundation model for pathology that automatically partitions WSIs into several morphologically relevant regions. Specifically, CARE employs a two-stage pretraining strategy: (1) a self-supervised unimodal pretraining stage that learns morphological representations from 34,277 whole-slide images (WSIs) without segmentation annotations, and (2) a cross-modal alignment stage that leverages RNA and protein profiles to refine the construction and representation of adaptive regions. This molecular guidance enables CARE to identify biologically relevant patterns and generate irregular yet coherent tissue regions, selecting the most representative area as ROI. CARE supports a broad range of pathology-related tasks, using either the ROI feature or the slide-level feature obtained by aggregating adaptive regions. Based on only one-tenth of the pretraining data typically used by mainstream foundation models, CARE achieves superior average performance across 33 downstream benchmarks, including morphological classification, molecular prediction, and survival analysis, and outperforms other foundation model baselines overall.
Paper Structure (23 sections, 7 equations, 8 figures, 16 tables)

This paper contains 23 sections, 7 equations, 8 figures, 16 tables.

Figures (8)

  • Figure 1: CARE vs. conventional CPath chunking on WSIs. Panels (a) and (b) depict patch chunks and regular region chunks, respectively. They both impose artificial grids that split tissue, making them inefficient and semantically weak. Panel (c) shows our adaptive region chunks (CARE), which behave like word-level tokens, respect tissue boundaries, and capture meaningful texture, morphology, and cell layout.
  • Figure 2: CARE architecture and training pipeline. (a) CARE framework with iBOT-style self-supervised pretraining using a teacher–student setup. CARE comprises three modules: the Adaptive Region Generator (ARG), adaptive region self-attention (ARSA), and Semantic and Prior Fusion (SPF). ARG partitions WSIs into morphologically coherent regions. ARSA operates within each adaptive region to derive region level features. SPF aggregates these features into a WSI embedding. (b) Cross-modal pretraining pipeline. RNA and protein encoders produce molecular embeddings, which are aligned to CARE’s WSI embedding via an InfoNCE loss.
  • Figure 3: (a) Adaptive Region Generator. Based on soft inclusion, each patch retains only its top-3 candidate subregions and masks out the rest. Cosine similarity is then computed to the unmasked candidates, and the patch is assigned to the highest-scoring subregion, yielding an adaptive repartition of patches. (b) Semantic and Prior Fusion. A lightweight module that aggregates adaptive region features into a slide-level embedding.
  • Figure 4: AUROC (or F1) results on 33 downstream tasks under a logistic regression (or a linear layer for survival analysis) setting. The best-performing results are outlined with black boxes.
  • Figure 5: Box plots of experimental results. Each box plot aggregates results from all morphology-classification and molecular-classification tasks. The horizontal line inside each box indicates the median for that model across tasks. The green circle denotes the mean.
  • ...and 3 more figures