Table of Contents
Fetching ...

Knowledge-enhanced Pretraining for Vision-language Pathology Foundation Model on Cancer Diagnosis

Xiao Zhou, Luoyi Sun, Dexuan He, Wenbin Guan, Ge Wang, Ruifen Wang, Lifeng Wang, Xiaojun Yuan, Xin Sun, Ya Zhang, Kun Sun, Yanfeng Wang, Weidi Xie

TL;DR

KEEP (KnowledgE-Enhanced Pathology), a foundation model that systematically incorporates disease knowledge into pretraining for cancer diagnosis, is introduced, establishing knowledge-enhanced vision-language modeling as a powerful paradigm for advancing computational pathology.

Abstract

Vision-language foundation models have shown great promise in computational pathology but remain primarily data-driven, lacking explicit integration of medical knowledge. We introduce KEEP (KnowledgE-Enhanced Pathology), a foundation model that systematically incorporates disease knowledge into pretraining for cancer diagnosis. KEEP leverages a comprehensive disease knowledge graph encompassing 11,454 diseases and 139,143 attributes to reorganize millions of pathology image-text pairs into 143,000 semantically structured groups aligned with disease ontology hierarchies. This knowledge-enhanced pretraining aligns visual and textual representations within hierarchical semantic spaces, enabling deeper understanding of disease relationships and morphological patterns. Across 18 public benchmarks (over 14,000 whole-slide images) and 4 institutional rare cancer datasets (926 cases), KEEP consistently outperformed existing foundation models, showing substantial gains for rare subtypes. These results establish knowledge-enhanced vision-language modeling as a powerful paradigm for advancing computational pathology.

Knowledge-enhanced Pretraining for Vision-language Pathology Foundation Model on Cancer Diagnosis

TL;DR

KEEP (KnowledgE-Enhanced Pathology), a foundation model that systematically incorporates disease knowledge into pretraining for cancer diagnosis, is introduced, establishing knowledge-enhanced vision-language modeling as a powerful paradigm for advancing computational pathology.

Abstract

Vision-language foundation models have shown great promise in computational pathology but remain primarily data-driven, lacking explicit integration of medical knowledge. We introduce KEEP (KnowledgE-Enhanced Pathology), a foundation model that systematically incorporates disease knowledge into pretraining for cancer diagnosis. KEEP leverages a comprehensive disease knowledge graph encompassing 11,454 diseases and 139,143 attributes to reorganize millions of pathology image-text pairs into 143,000 semantically structured groups aligned with disease ontology hierarchies. This knowledge-enhanced pretraining aligns visual and textual representations within hierarchical semantic spaces, enabling deeper understanding of disease relationships and morphological patterns. Across 18 public benchmarks (over 14,000 whole-slide images) and 4 institutional rare cancer datasets (926 cases), KEEP consistently outperformed existing foundation models, showing substantial gains for rare subtypes. These results establish knowledge-enhanced vision-language modeling as a powerful paradigm for advancing computational pathology.

Paper Structure

This paper contains 34 sections, 1 theorem, 22 equations, 12 figures.

Key Result

Theorem 1

Let $A_c(p)$ denote the proportion of tumor region in a pathological patch $p$, and $P(C|p)$ denote the probability that a classifier predicts $p$ as cancerous. Under a linear feature representation and sigmoid classifier with significant tumor feature contribution, $P(C|p)$ is positively correlated

Figures (12)

  • Figure 1: Overview of KEEP.A. Example disease structure in the constructed knowledge graph. Each node represents a disease, consisting of three attribute types: hierarchical relations, synonyms, and definitions, as indicated by the dashed line box. B. The knowledge encoding and vision-language alignment stage for the KEEP model. A BERT-based text encoder is initially trained to encode the disease knowledge through metric learning. A knowledge-enhanced vision-language pre-training approach is proposed to align pathology semantic groups with filtered images and augmented captions. C. For downstream cancer diagnostic tasks, including cancer region segmentation, cancer detection, and cancer subtyping, whole slide images (WSIs) are divided into tile images for zero-shot classification, with the results of each tile combined to determine the final diagnostic decision. The text prompt for zero-shot classification is [template + disease name], for instance, A histopathology image of lung adenocarcinoma.D. The flowchart of cancer diagnosis, including WSI pre-processing and titling, Tile-level processing through KEEP model, and mapping and aggregation of predictions. E. Performance comparison of cancer diagnosis with the state-of-the-art methods on 18 public benchmarks of more than 14,000 WSIs. The details of all datasets can be found in Table S1. F. Performance comparison of tile-level classification with the state-of-the-art methods on 14 benchmarks. The inner and outer numbers indicate the worst and best results, respectively. Also see Figure \ref{['fig:supp_statictis']} and Table S1.
  • Figure 2: Model architecture of KEEP. A. Disease knowledge encoding. We establish a knowledge graph that includes hypernym relations, synonyms, and definitions of diseases, and pretrain a disease knowledge encoder. Diseases at different levels are represented by different colors. B. Knowledge-guided dataset structuring. We fine-tune YOLOv8 to remove noise in the pathology image dataset, extract medical entities from the captions, align the diseases in the captions with the diseases and synonyms in the knowledge graph, and cluster the filtered image and text data into semantic groups. The right side illustrates two specific methods used during the clustering process. C. Knowledge-enhanced vision-language pretraining. We perform cropping and random dropping augmentations on the images and texts, and paraphrase captions that contain diseases using templates. During the training process, to mitigate the impact of false negatives, we design strategies for positive mining, hardest negative, and false negative elimination. Also see Figure \ref{['fig:supp_disease_chain']} and Table S2.
  • Figure 3: KEEP enhances slide-level cancer region segmentation. A. The scheme of zero-shot segmentation on WSIs, where individual tiles undergo binary classification and are then combined to delineate the cancerous region. B-C. Performance comparisons of AUPRC and DICE scores for various models, including PLIP huang2023visual, QuiltNet ikezogwo2024quilt, MI-Zero lu2023visual, CONCH lu2024visual, and MUSK xiang2025vision, and our proposed KEEP, across three WSI datasets: CAMELYON16 bejnordi2017camelyon16 (48 WSIs), PANDA bulten2022panda (10,494 WSIs), and AGGC22 huo2024aggc22 (128 WSIs). The DICE score is calculated using the average threshold corresponding to the optimal cutoff point of ROC curves in each dataset. “Before” and “after” represent the segmentation results before and after post-processing with a morphological opening operation, which removes small noisy regions while preserving the shape of larger structures. The box plots present the median, first, and third quartiles of results. The paired t-test is used to assess the statistical significance between the performance distributions of different models. ** denotes $P < 0.01$, and *** denotes $P < 0.001$. D-E. Performance comparisons of AUPRC and DICE scores between KEEP and text-based segmentation models, including MedSAM ma2024medsam, BiomedParse zhao2025biomedparse, and PathSeg chen2025pathseg. The box plots present the median, first, and third quartiles of results, with $\mu$ indicating the average performance. The paired t-test is used to assess the statistical significance between the performance distributions of different models. ** denotes $P < 0.01$, and *** denotes $P < 0.001$. F. Exemplary segmentation results from three datasets (the first two for CAMELYON16, the middle two for PANDA, and the last two for AGGC22) before and after post-processing. The number in the top-left of each result image suggests the DICE score. Also see Figure \ref{['fig:supp_seg']} and Table S3.
  • Figure 4: KEEP enhances slide-level cancer detection. A. The zero-shot cancer detection scheme on WSIs, where individual tiles undergo binary classification. The probability of a slide being cancerous is determined by the predicted tumor ratio which is calculated by the ratio of tumor tiles to all valid tiles. B. The comparison of the predicted tumor ratio between normal and cancer WSIs in CPTAC-CM and CPTAC-CCRCC datasets. Two-sided Welch’s t test is used to assess the statistical significance of predicted tumor ratios among different WSIs. C-I. Comparison of ROC curves across different models, including CHIEF wang2024pathology, PLIP huang2023visual, QuiltNet ikezogwo2024quilt, MI-Zero lu2023visual, CONCH lu2024visual, and MUSK xiang2025vision and KEEP, evaluated on 7 CPTAC datasets across 6 tissue anatomies: skin, kidney, pancreas, uterine, lung, and head and neck. Each dataset consists of 75 normal WSIs and 75 cancer slides, with each experiment using 1,000 bootstrap iterations. The AUROC for each model is reported as the median along with its 95% confidence intervals (CIs). J. Comparison of average sensitivities across all datasets at the specificity of 0.95, the error bar denotes the standard deviation of the performance. K. Comparison of ROC curves across different models, evaluated on a rare cancer dataset, which consists of 59 nephroblastoma WSIs and 51 normal WSIs. L. The average AUROC performance with standard deviation of different foundation models on 8 cancer detection datasets in few-shot (1,2,4,8) and 5-fold cross-validation settings. M. Example visualizations of cancer detection on CPATC-CM, CPTAC-UCEC datasets. The first and the second rows denote the normal and the cancer WSIs. The heat map is generated by the similarities between the embeddings of tile images and those of "tumor" prompts. Also see Figure \ref{['fig:supp_det']} and Table S4.
  • Figure 5: KEEP enhances slide-level cancer subtyping. A. The zero-shot cancer subtyping scheme on WSIs, where individual tiles undergo multi-class classification, including a "normal" label and tumor subtype labels. The probability of a slide being classified as type I is determined by the ratio of type I tiles to all valid tiles. B. Comparison of average balanced accuracy with standard deviation across different models on seven datasets with common cancer subtypes, with each experiment using 1,000 bootstrap iterations. C. Performance comparison of different models on the rare cancer subtyping dataset, EBRAINS, which consists of 30 rare brain cancer subtypes, each with 30 WSIs. D. The confusion matrix of the KEEP model on the rare brain cancer dataset, EBRAINS. E. Ablation results on WSI tasks. Performance comparison between naïve contrastive and knowledge-enhanced (KEEP). F. Ablation results between naïve contrastive with Top-100 pooling strategy (Contrast-Top100), KEEP with Top-100 pooling strategy (KEEP-top100) and KEEP with tumor-ratio strategy (KEEP-Ratio). G. The average BACC performance with standard deviation of different foundation models on 7 common cancer subtyping datasets in few-shot (1,2,4,8) and 5-fold cross-validation settings. H. Example WSIs for tumor subtyping. The left and the right WSIs denote esophagus adenocarcinoma and esophagus squamous cell carcinoma, respectively. The orange and the green masks denote the predicted regions of adenocarcinoma and squamous cell carcinoma, respectively. The blue squares denote the tile image from the area with normal predictions. (H)Also see Figure \ref{['fig:supp_sub']} and Table S5.
  • ...and 7 more figures

Theorems & Definitions (2)

  • proof
  • Theorem 1: Positive Correlation of Tumor-ratio and Cancer Probability