Table of Contents
Fetching ...

PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration

Yuxuan Sun, Yunlong Zhang, Yixuan Si, Chenglu Zhu, Zhongyi Shui, Kai Zhang, Jingxiong Li, Xingheng Lyu, Tao Lin, Lin Yang

TL;DR

PathGen-1.6M presents a scalable, multi-agent data-generation framework that creates 1.6 million high-quality pathology image-text pairs from TCGA WSIs to train pathology-specific CLIP backbones (PathGen-CLIP and PathGen-CLIP-L) and an LLM-aligned model (PathGen-LLaVA). The approach yields substantial gains in zero-shot and few-shot image classification, as well as whole-slide image analysis, and demonstrates strong integration with LLMs that surpasses existing pathology LMMs and even some general models on PathMMU. By leveraging high-resolution WSIs and a cascaded agent architecture (patch extraction, captioning, revision, and summarization), PathGen provides a scalable pathway to robust general pathology models with multimodal capabilities. The work also discusses limitations, notably dependence on WSI reports, and outlines avenues for reducing this reliance while expanding data coverage and capabilities.

Abstract

Vision Language Models (VLMs) like CLIP have attracted substantial attention in pathology, serving as backbones for applications such as zero-shot image classification and Whole Slide Image (WSI) analysis. Additionally, they can function as vision encoders when combined with large language models (LLMs) to support broader capabilities. Current efforts to train pathology VLMs rely on pathology image-text pairs from platforms like PubMed, YouTube, and Twitter, which provide limited, unscalable data with generally suboptimal image quality. In this work, we leverage large-scale WSI datasets like TCGA to extract numerous high-quality image patches. We then train a large multimodal model to generate captions for these images, creating PathGen-1.6M, a dataset containing 1.6 million high-quality image-caption pairs. Our approach involves multiple agent models collaborating to extract representative WSI patches, generating and refining captions to obtain high-quality image-text pairs. Extensive experiments show that integrating these generated pairs with existing datasets to train a pathology-specific CLIP model, PathGen-CLIP, significantly enhances its ability to analyze pathological images, with substantial improvements across nine pathology-related zero-shot image classification tasks and three whole-slide image tasks. Furthermore, we construct 200K instruction-tuning data based on PathGen-1.6M and integrate PathGen-CLIP with the Vicuna LLM to create more powerful multimodal models through instruction tuning. Overall, we provide a scalable pathway for high-quality data generation in pathology, paving the way for next-generation general pathology models.

PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration

TL;DR

PathGen-1.6M presents a scalable, multi-agent data-generation framework that creates 1.6 million high-quality pathology image-text pairs from TCGA WSIs to train pathology-specific CLIP backbones (PathGen-CLIP and PathGen-CLIP-L) and an LLM-aligned model (PathGen-LLaVA). The approach yields substantial gains in zero-shot and few-shot image classification, as well as whole-slide image analysis, and demonstrates strong integration with LLMs that surpasses existing pathology LMMs and even some general models on PathMMU. By leveraging high-resolution WSIs and a cascaded agent architecture (patch extraction, captioning, revision, and summarization), PathGen provides a scalable pathway to robust general pathology models with multimodal capabilities. The work also discusses limitations, notably dependence on WSI reports, and outlines avenues for reducing this reliance while expanding data coverage and capabilities.

Abstract

Vision Language Models (VLMs) like CLIP have attracted substantial attention in pathology, serving as backbones for applications such as zero-shot image classification and Whole Slide Image (WSI) analysis. Additionally, they can function as vision encoders when combined with large language models (LLMs) to support broader capabilities. Current efforts to train pathology VLMs rely on pathology image-text pairs from platforms like PubMed, YouTube, and Twitter, which provide limited, unscalable data with generally suboptimal image quality. In this work, we leverage large-scale WSI datasets like TCGA to extract numerous high-quality image patches. We then train a large multimodal model to generate captions for these images, creating PathGen-1.6M, a dataset containing 1.6 million high-quality image-caption pairs. Our approach involves multiple agent models collaborating to extract representative WSI patches, generating and refining captions to obtain high-quality image-text pairs. Extensive experiments show that integrating these generated pairs with existing datasets to train a pathology-specific CLIP model, PathGen-CLIP, significantly enhances its ability to analyze pathological images, with substantial improvements across nine pathology-related zero-shot image classification tasks and three whole-slide image tasks. Furthermore, we construct 200K instruction-tuning data based on PathGen-1.6M and integrate PathGen-CLIP with the Vicuna LLM to create more powerful multimodal models through instruction tuning. Overall, we provide a scalable pathway for high-quality data generation in pathology, paving the way for next-generation general pathology models.
Paper Structure (13 sections, 3 figures, 3 tables)

This paper contains 13 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of the scale of the PathGen dataset (left), the performance of the proposed PathGen-CLIP (middle), and the PathGen-LLaVA (right), both derived from training on PathGen.
  • Figure 2: Illustration of the multi-agent collaboration pipeline for generating pathology image-text pairs. This process comprises two main components: (1) Representative Patches Extraction, which utilizes prompt-based cross-modal retrieval and clustering; and (2) Description Generation, where multiple LMM and LLM agents are employed to generate, revise, and summarize descriptions.
  • Figure 3: Comparison of few-shot classification using linear probing with different CLIP models on various pathology image classification datasets with accuracy (%).