Table of Contents
Fetching ...

CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology

Yuxuan Sun, Yixuan Si, Chenglu Zhu, Xuan Gong, Kai Zhang, Pingyi Chen, Ye Zhang, Zhongyi Shui, Tao Lin, Lin Yang

TL;DR

CPath-Omni presents a unified multimodal foundation model for pathology that jointly handles patch-level and WSI-level analysis. It integrates a pathology-focused CLIP (CPath-CLIP) with a SlideParser-based WSI processor and an LLM (Qwen-2.5) in a four-stage training regime, leveraging 700k patch captions and tens of thousands of WSI instructions across 42 datasets. The model achieves state-of-the-art performance on 39 of 42 tasks, including patch VQA, classification, captioning, and WSI VQA and captioning, and demonstrates strong zero-shot/few-shot capabilities and competitive WSI subtyping performance. The work demonstrates the feasibility and value of a one-for-all pathology foundation model, offering broad practical impact for clinical decision support and standardizing multimodal pathology AI.

Abstract

The emergence of large multimodal models (LMMs) has brought significant advancements to pathology. Previous research has primarily focused on separately training patch-level and whole-slide image (WSI)-level models, limiting the integration of learned knowledge across patches and WSIs, and resulting in redundant models. In this work, we introduce CPath-Omni, the first 15-billion-parameter LMM designed to unify both patch and WSI level image analysis, consolidating a variety of tasks at both levels, including classification, visual question answering, captioning, and visual referring prompting. Extensive experiments demonstrate that CPath-Omni achieves state-of-the-art (SOTA) performance across seven diverse tasks on 39 out of 42 datasets, outperforming or matching task-specific models trained for individual tasks. Additionally, we develop a specialized pathology CLIP-based visual processor for CPath-Omni, CPath-CLIP, which, for the first time, integrates different vision models and incorporates a large language model as a text encoder to build a more powerful CLIP model, which achieves SOTA performance on nine zero-shot and four few-shot datasets. Our findings highlight CPath-Omni's ability to unify diverse pathology tasks, demonstrating its potential to streamline and advance the field of foundation model in pathology.

CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology

TL;DR

CPath-Omni presents a unified multimodal foundation model for pathology that jointly handles patch-level and WSI-level analysis. It integrates a pathology-focused CLIP (CPath-CLIP) with a SlideParser-based WSI processor and an LLM (Qwen-2.5) in a four-stage training regime, leveraging 700k patch captions and tens of thousands of WSI instructions across 42 datasets. The model achieves state-of-the-art performance on 39 of 42 tasks, including patch VQA, classification, captioning, and WSI VQA and captioning, and demonstrates strong zero-shot/few-shot capabilities and competitive WSI subtyping performance. The work demonstrates the feasibility and value of a one-for-all pathology foundation model, offering broad practical impact for clinical decision support and standardizing multimodal pathology AI.

Abstract

The emergence of large multimodal models (LMMs) has brought significant advancements to pathology. Previous research has primarily focused on separately training patch-level and whole-slide image (WSI)-level models, limiting the integration of learned knowledge across patches and WSIs, and resulting in redundant models. In this work, we introduce CPath-Omni, the first 15-billion-parameter LMM designed to unify both patch and WSI level image analysis, consolidating a variety of tasks at both levels, including classification, visual question answering, captioning, and visual referring prompting. Extensive experiments demonstrate that CPath-Omni achieves state-of-the-art (SOTA) performance across seven diverse tasks on 39 out of 42 datasets, outperforming or matching task-specific models trained for individual tasks. Additionally, we develop a specialized pathology CLIP-based visual processor for CPath-Omni, CPath-CLIP, which, for the first time, integrates different vision models and incorporates a large language model as a text encoder to build a more powerful CLIP model, which achieves SOTA performance on nine zero-shot and four few-shot datasets. Our findings highlight CPath-Omni's ability to unify diverse pathology tasks, demonstrating its potential to streamline and advance the field of foundation model in pathology.

Paper Structure

This paper contains 24 sections, 19 figures, 15 tables.

Figures (19)

  • Figure 1: Overview of CPath-Omni’s ability to handle both patch-level and WSI analysis in clinical environments, such as microscope views and scanned WSIs, while supporting various tasks.
  • Figure 2: Overview of two key vision components of CPath-Omni: the patch-level model, CPath-CLIP, and the WSI model, SlideParser.
  • Figure 3: Comparison of few-shot classification accuracy (%) via linear probing across various datasets using different CLIP models.
  • Figure 4: Radar plot visualization of CPath-Omni’s performance on patch and WSI classification tasks: (a) patch-level performance under ID conditions, (b) patch classification performance under OOD/zero-shot conditions, and (c) whole-slide image (WSI) performance.
  • Figure B.1: CProportions of sub-datasets in CPath-PatchCaption and their primary sources.
  • ...and 14 more figures