Table of Contents
Fetching ...

PathAsst: A Generative Foundation AI Assistant Towards Artificial General Intelligence of Pathology

Yuxuan Sun, Chenglu Zhu, Sunyi Zheng, Kai Zhang, Lin Sun, Zhongyi Shui, Yunlong Zhang, Honglin Li, Lin Yang

TL;DR

PathAsst tackles the scarcity of high-quality pathology data for multimodal foundation models by constructing two specialized datasets, PathCap and PathInstruct. It couples a pathology-tuned CLIP (PathCLIP) with Vicuna-13b, and uses a two-phase instruction-tuning regime to create PathAsst, a multimodal assistant capable of invoking eight pathology-specific sub-models and a paper-retrieval system for up-to-date context. Empirical results show PathCLIP outperforms general R and existing pathology baselines in image-text retrieval and zero-shot classification, while PathAsst achieves superior performance on pathology VQA tasks compared to prior MLLMs, especially when augmented with PathCLIP and model invocations. The integration of tool-enabled inference and literature retrieval demonstrates practical potential for improving diagnostic accuracy and supporting pathology workflows in real-world settings.

Abstract

As advances in large language models (LLMs) and multimodal techniques continue to mature, the development of general-purpose multimodal large language models (MLLMs) has surged, offering significant applications in interpreting natural images. However, the field of pathology has largely remained untapped, particularly in gathering high-quality data and designing comprehensive model frameworks. To bridge the gap in pathology MLLMs, we present PathAsst, a multimodal generative foundation AI assistant to revolutionize diagnostic and predictive analytics in pathology. The development of PathAsst involves three pivotal steps: data acquisition, CLIP model adaptation, and the training of PathAsst's multimodal generative capabilities. Firstly, we collect over 207K high-quality pathology image-text pairs from authoritative sources. Leveraging the advanced power of ChatGPT, we generate over 180K instruction-following samples. Furthermore, we devise additional instruction-following data specifically tailored for invoking eight pathology-specific sub-models we prepared, allowing the PathAsst to effectively collaborate with these models, enhancing its diagnostic ability. Secondly, by leveraging the collected data, we construct PathCLIP, a pathology-dedicated CLIP, to enhance PathAsst's capabilities in interpreting pathology images. Finally, we integrate PathCLIP with the Vicuna-13b and utilize pathology-specific instruction-tuning data to enhance the multimodal generation capacity of PathAsst and bolster its synergistic interactions with sub-models. The experimental results of PathAsst show the potential of harnessing AI-powered generative foundation model to improve pathology diagnosis and treatment processes.

PathAsst: A Generative Foundation AI Assistant Towards Artificial General Intelligence of Pathology

TL;DR

PathAsst tackles the scarcity of high-quality pathology data for multimodal foundation models by constructing two specialized datasets, PathCap and PathInstruct. It couples a pathology-tuned CLIP (PathCLIP) with Vicuna-13b, and uses a two-phase instruction-tuning regime to create PathAsst, a multimodal assistant capable of invoking eight pathology-specific sub-models and a paper-retrieval system for up-to-date context. Empirical results show PathCLIP outperforms general R and existing pathology baselines in image-text retrieval and zero-shot classification, while PathAsst achieves superior performance on pathology VQA tasks compared to prior MLLMs, especially when augmented with PathCLIP and model invocations. The integration of tool-enabled inference and literature retrieval demonstrates practical potential for improving diagnostic accuracy and supporting pathology workflows in real-world settings.

Abstract

As advances in large language models (LLMs) and multimodal techniques continue to mature, the development of general-purpose multimodal large language models (MLLMs) has surged, offering significant applications in interpreting natural images. However, the field of pathology has largely remained untapped, particularly in gathering high-quality data and designing comprehensive model frameworks. To bridge the gap in pathology MLLMs, we present PathAsst, a multimodal generative foundation AI assistant to revolutionize diagnostic and predictive analytics in pathology. The development of PathAsst involves three pivotal steps: data acquisition, CLIP model adaptation, and the training of PathAsst's multimodal generative capabilities. Firstly, we collect over 207K high-quality pathology image-text pairs from authoritative sources. Leveraging the advanced power of ChatGPT, we generate over 180K instruction-following samples. Furthermore, we devise additional instruction-following data specifically tailored for invoking eight pathology-specific sub-models we prepared, allowing the PathAsst to effectively collaborate with these models, enhancing its diagnostic ability. Secondly, by leveraging the collected data, we construct PathCLIP, a pathology-dedicated CLIP, to enhance PathAsst's capabilities in interpreting pathology images. Finally, we integrate PathCLIP with the Vicuna-13b and utilize pathology-specific instruction-tuning data to enhance the multimodal generation capacity of PathAsst and bolster its synergistic interactions with sub-models. The experimental results of PathAsst show the potential of harnessing AI-powered generative foundation model to improve pathology diagnosis and treatment processes.
Paper Structure (20 sections, 1 equation, 7 figures, 3 tables)

This paper contains 20 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Illustration of data processing: pathology image selection, sub-figure & caption separation, and refinement.
  • Figure 2: Examples of pathology-specific model-invoking instruction-following samples.
  • Figure 3: An illustration of the overall framework of PathAsst. The multimodal MLLM training encompasses the training processes of both PathCLIP and PathAsst, as well as the construction of a paper embedding database. The tool-augmented MLLM inference details the process of PathAsst utilizing various tools to enhance the quality of its generated outputs.
  • Figure 4: Comparative assessment of image retrieval performance between CLIP models across collected datasets.
  • Figure 5: Example of PathAsst calls generation model.
  • ...and 2 more figures