CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology
Yuxuan Sun, Yixuan Si, Chenglu Zhu, Xuan Gong, Kai Zhang, Pingyi Chen, Ye Zhang, Zhongyi Shui, Tao Lin, Lin Yang
TL;DR
CPath-Omni presents a unified multimodal foundation model for pathology that jointly handles patch-level and WSI-level analysis. It integrates a pathology-focused CLIP (CPath-CLIP) with a SlideParser-based WSI processor and an LLM (Qwen-2.5) in a four-stage training regime, leveraging 700k patch captions and tens of thousands of WSI instructions across 42 datasets. The model achieves state-of-the-art performance on 39 of 42 tasks, including patch VQA, classification, captioning, and WSI VQA and captioning, and demonstrates strong zero-shot/few-shot capabilities and competitive WSI subtyping performance. The work demonstrates the feasibility and value of a one-for-all pathology foundation model, offering broad practical impact for clinical decision support and standardizing multimodal pathology AI.
Abstract
The emergence of large multimodal models (LMMs) has brought significant advancements to pathology. Previous research has primarily focused on separately training patch-level and whole-slide image (WSI)-level models, limiting the integration of learned knowledge across patches and WSIs, and resulting in redundant models. In this work, we introduce CPath-Omni, the first 15-billion-parameter LMM designed to unify both patch and WSI level image analysis, consolidating a variety of tasks at both levels, including classification, visual question answering, captioning, and visual referring prompting. Extensive experiments demonstrate that CPath-Omni achieves state-of-the-art (SOTA) performance across seven diverse tasks on 39 out of 42 datasets, outperforming or matching task-specific models trained for individual tasks. Additionally, we develop a specialized pathology CLIP-based visual processor for CPath-Omni, CPath-CLIP, which, for the first time, integrates different vision models and incorporates a large language model as a text encoder to build a more powerful CLIP model, which achieves SOTA performance on nine zero-shot and four few-shot datasets. Our findings highlight CPath-Omni's ability to unify diverse pathology tasks, demonstrating its potential to streamline and advance the field of foundation model in pathology.
