Table of Contents
Fetching ...

A Foundational Multimodal Vision Language AI Assistant for Human Pathology

Ming Y. Lu, Bowen Chen, Drew F. K. Williamson, Richard J. Chen, Kenji Ikamura, Georg Gerber, Ivy Liang, Long Phi Le, Tong Ding, Anil V Parwani, Faisal Mahmood

TL;DR

PathChat is presented, a vision-language generalist AI assistant for human pathology using an in-house developed foundational vision encoder pretrained on 100 million histology images from over 100,000 patient cases and 1.18 million pathology image-caption pairs.

Abstract

The field of computational pathology has witnessed remarkable progress in the development of both task-specific predictive models and task-agnostic self-supervised vision encoders. However, despite the explosive growth of generative artificial intelligence (AI), there has been limited study on building general purpose, multimodal AI assistants tailored to pathology. Here we present PathChat, a vision-language generalist AI assistant for human pathology using an in-house developed foundational vision encoder pretrained on 100 million histology images from over 100,000 patient cases and 1.18 million pathology image-caption pairs. The vision encoder is then combined with a pretrained large language model and the whole system is finetuned on over 250,000 diverse disease agnostic visual language instructions. We compare PathChat against several multimodal vision language AI assistants as well as GPT4V, which powers the commercially available multimodal general purpose AI assistant ChatGPT-4. When relevant clinical context is provided with the histology image, PathChat achieved a diagnostic accuracy of 87% on multiple-choice questions based on publicly available cases of diverse tissue origins and disease models. Additionally, using open-ended questions and human expert evaluation, we found that overall PathChat produced more accurate and pathologist-preferable responses to diverse queries related to pathology. As an interactive and general vision language AI assistant that can flexibly handle both visual and natural language inputs, PathChat can potentially find impactful applications in pathology education, research, and human-in-the-loop clinical decision making.

A Foundational Multimodal Vision Language AI Assistant for Human Pathology

TL;DR

PathChat is presented, a vision-language generalist AI assistant for human pathology using an in-house developed foundational vision encoder pretrained on 100 million histology images from over 100,000 patient cases and 1.18 million pathology image-caption pairs.

Abstract

The field of computational pathology has witnessed remarkable progress in the development of both task-specific predictive models and task-agnostic self-supervised vision encoders. However, despite the explosive growth of generative artificial intelligence (AI), there has been limited study on building general purpose, multimodal AI assistants tailored to pathology. Here we present PathChat, a vision-language generalist AI assistant for human pathology using an in-house developed foundational vision encoder pretrained on 100 million histology images from over 100,000 patient cases and 1.18 million pathology image-caption pairs. The vision encoder is then combined with a pretrained large language model and the whole system is finetuned on over 250,000 diverse disease agnostic visual language instructions. We compare PathChat against several multimodal vision language AI assistants as well as GPT4V, which powers the commercially available multimodal general purpose AI assistant ChatGPT-4. When relevant clinical context is provided with the histology image, PathChat achieved a diagnostic accuracy of 87% on multiple-choice questions based on publicly available cases of diverse tissue origins and disease models. Additionally, using open-ended questions and human expert evaluation, we found that overall PathChat produced more accurate and pathologist-preferable responses to diverse queries related to pathology. As an interactive and general vision language AI assistant that can flexibly handle both visual and natural language inputs, PathChat can potentially find impactful applications in pathology education, research, and human-in-the-loop clinical decision making.
Paper Structure (1 section, 1 equation, 11 figures, 25 tables)

This paper contains 1 section, 1 equation, 11 figures, 25 tables.

Figures (11)

  • Figure : Figure 1: Instruction-following dataset curation and PathChat overview.a. We curated the currently largest instruction finetuning dataset specialized for the domain of pathology, consisting of 257k instructions and corresponding responses covering varied formats (e.g. multi-turn conversations, multiple-choice questions, short answers; see Extended Data Figure 1 for complete examples) from diverse sources. b. To build an MLLM-based vision language AI assistant that can reason over visual and natural language inputs, we begin with a SOTA vision-only self-supervised pretrained foundation encoder model, UNI, and perform further vision language pretraining analogous to CONCH. The resulting vision encoder, CONCH-Large, is subsequently connected to a 13 billion parameter, pretrained LLM via a multimodal projector module (not shown) to form the complete MLLM architecture. The MLLM is finetuned via the curated instruction-following dataset to build PathChat, a visual language AI assistant specialized for human pathology. More details about data curation and model training can be found in PathChat dataset curation and PathChat model design and training section of Methods respectively.
  • Figure : Figure 2: Multiple choice evaluation of PathChat.a. Illustrative example of a multiple-choice style diagnostic question. The input always includes a salient histology image ROI selected by a board-certified anatomic pathologist and the instruction to select the most likely diagnosis from a set of possible choices. In the image + clinical context evaluation setting that is designed to more closely mimic a real-world diagnostic workflow, additional relevant clinical context (designed by the pathologist, shown in blue) is provided together with the histology image and concatenated with the original instruction. b. Accuracy of MLLMs on multiple choice-style diagnostic questions. Note that we only compare against GPT4V on questions based on publicly available cases (PathQABench-Public). c. Accuracy of MLLMs on open-ended questions. b, c. Error bars represent 95% confidence intervals. d. Accuracy on different categories of questions. e. Head-to-head records on open-ended questions for PathChat v.s. other MLLMs. Lose: said model is ranked higher than PathChat; Tie: PathChat is tied with the model in ranking; Win: PathChat is ranked higher than the model.
  • Figure : Figure 3: Exploring additional use cases of PathChat. Beyond evaluating PathChat on multiple choice-style questions and single turn open-ended question answering, we explore additional use cases and demonstrate examples that involve follow-up questions from users in the form of interactive, multi-turn conversations. a PathChat can describe tumor tissue and cell morphology, infer the diagnosis, and correctly suggest potential IHC findings grounded in relevant background knowledge about the suspected malignancy. b. PathChat can summarize key morphological features in the histology image and based on additional clinical context, can reasonably infer the primary origin of the tumor. c. PathChat understands and can attempt to follow well-known guidelines on tumor grading, in this case, the Gleason grade system for prostate adenocarcinoma. d. PathChat is familiar with different cell markers and can help potentially guide IHC interpretations. e. PathChat can potentially be consulted to perform human-in-the-loop differential diagnosis that may require multiple rounds of IHC workup.
  • Figure : Extended Figure 1: Examples of instructions for finetuning MLLM. An example of each of six different types of instructions to develop PathChat via instruction finetuning is illustrated. Bolded texts represent instructions provided to the model while italicized texts represent the reference outputs the model is expected to output during training. More details on dataset curation are provided in the PathChat dataset curation section of Methods.
  • Figure : Extended Figure 2: Comparing model outputs on open-ended question answering, example 1. An example question in PathQABench-Public, for which the response by PathChat is ranked higher (considered more preferable by the expert pathologist) than other models as it clearly and correctly addresses the query that asks for the most probable diagnosis as well as provides a reasonable description of the image. The other models give the same incorrect diagnosis of glioblastoma multiforme with outdated terminology. For this example, an expert pathologist ranked the PathChat output first followed by the other three models ranked equally.
  • ...and 6 more figures