Table of Contents
Fetching ...

Multi-Modal Foundation Models for Computational Pathology: A Survey

Dong Li, Guihong Wan, Xintao Wu, Xinyu Wu, Xiaohui Chen, Yi He, Christine G. Lian, Peter K. Sorger, Yevgeniy R. Semenov, Chen Zhao

TL;DR

The paper tackles how to scale and generalize AI for computational pathology by integrating histology with textual, knowledge-based, and molecular data. It surveys 32 MMFM4CPath across vision-language, vision-knowledge graph, and vision-gene expression paradigms, distinguishing non-LLM and LLM-based vision-language methods. It catalogs 28 pathology-specific datasets and presents a taxonomy of downstream tasks, training strategies, and evaluation approaches. The findings highlight opportunities for spatial-omics integration and standardized benchmarking to enable clinically actionable multi-modal pathology AI.

Abstract

Foundation models have emerged as a powerful paradigm in computational pathology (CPath), enabling scalable and generalizable analysis of histopathological images. While early developments centered on uni-modal models trained solely on visual data, recent advances have highlighted the promise of multi-modal foundation models that integrate heterogeneous data sources such as textual reports, structured domain knowledge, and molecular profiles. In this survey, we provide a comprehensive and up-to-date review of multi-modal foundation models in CPath, with a particular focus on models built upon hematoxylin and eosin (H&E) stained whole slide images (WSIs) and tile-level representations. We categorize 32 state-of-the-art multi-modal foundation models into three major paradigms: vision-language, vision-knowledge graph, and vision-gene expression. We further divide vision-language models into non-LLM-based and LLM-based approaches. Additionally, we analyze 28 available multi-modal datasets tailored for pathology, grouped into image-text pairs, instruction datasets, and image-other modality pairs. Our survey also presents a taxonomy of downstream tasks, highlights training and evaluation strategies, and identifies key challenges and future directions. We aim for this survey to serve as a valuable resource for researchers and practitioners working at the intersection of pathology and AI.

Multi-Modal Foundation Models for Computational Pathology: A Survey

TL;DR

The paper tackles how to scale and generalize AI for computational pathology by integrating histology with textual, knowledge-based, and molecular data. It surveys 32 MMFM4CPath across vision-language, vision-knowledge graph, and vision-gene expression paradigms, distinguishing non-LLM and LLM-based vision-language methods. It catalogs 28 pathology-specific datasets and presents a taxonomy of downstream tasks, training strategies, and evaluation approaches. The findings highlight opportunities for spatial-omics integration and standardized benchmarking to enable clinically actionable multi-modal pathology AI.

Abstract

Foundation models have emerged as a powerful paradigm in computational pathology (CPath), enabling scalable and generalizable analysis of histopathological images. While early developments centered on uni-modal models trained solely on visual data, recent advances have highlighted the promise of multi-modal foundation models that integrate heterogeneous data sources such as textual reports, structured domain knowledge, and molecular profiles. In this survey, we provide a comprehensive and up-to-date review of multi-modal foundation models in CPath, with a particular focus on models built upon hematoxylin and eosin (H&E) stained whole slide images (WSIs) and tile-level representations. We categorize 32 state-of-the-art multi-modal foundation models into three major paradigms: vision-language, vision-knowledge graph, and vision-gene expression. We further divide vision-language models into non-LLM-based and LLM-based approaches. Additionally, we analyze 28 available multi-modal datasets tailored for pathology, grouped into image-text pairs, instruction datasets, and image-other modality pairs. Our survey also presents a taxonomy of downstream tasks, highlights training and evaluation strategies, and identifies key challenges and future directions. We aim for this survey to serve as a valuable resource for researchers and practitioners working at the intersection of pathology and AI.

Paper Structure

This paper contains 12 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: A roadmap of multi-modal foundation models for computational pathology (MMFM4CPath).
  • Figure 2: (Left) Illustration of whole-slide image and its corresponding tile images from H&E-stained tissue. (Right) The three primary types of multi-modal approaches in computational pathology.
  • Figure 3: A comprehensive taxonomy of MMFM4CPath, categorized according to evaluation tasks. Non-LLM-based vision-language, LLM-based vision-language, vision-knowledge graph, and vision-gene expression models are highlighted in different colors, respectively.