Table of Contents
Fetching ...

PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology

Yuxuan Sun, Hao Wu, Chenglu Zhu, Sunyi Zheng, Qizi Chen, Kai Zhang, Yunlong Zhang, Dan Wan, Xiaoxiao Lan, Mengyue Zheng, Jingxiong Li, Xinheng Lyu, Tao Lin, Lin Yang

TL;DR

The empirical findings indicate that advanced LMMs struggle with the challenging PathMMU benchmark, with the top-performing LMM, GPT-4V, achieving only a 49.8% zero-shot performance, significantly lower than the 71.8% demonstrated by human pathologists.

Abstract

The emergence of large multimodal models has unlocked remarkable potential in AI, particularly in pathology. However, the lack of specialized, high-quality benchmark impeded their development and precise evaluation. To address this, we introduce PathMMU, the largest and highest-quality expert-validated pathology benchmark for Large Multimodal Models (LMMs). It comprises 33,428 multimodal multi-choice questions and 24,067 images from various sources, each accompanied by an explanation for the correct answer. The construction of PathMMU harnesses GPT-4V's advanced capabilities, utilizing over 30,000 image-caption pairs to enrich captions and generate corresponding Q&As in a cascading process. Significantly, to maximize PathMMU's authority, we invite seven pathologists to scrutinize each question under strict standards in PathMMU's validation and test sets, while simultaneously setting an expert-level performance benchmark for PathMMU. We conduct extensive evaluations, including zero-shot assessments of 14 open-sourced and 4 closed-sourced LMMs and their robustness to image corruption. We also fine-tune representative LMMs to assess their adaptability to PathMMU. The empirical findings indicate that advanced LMMs struggle with the challenging PathMMU benchmark, with the top-performing LMM, GPT-4V, achieving only a 49.8% zero-shot performance, significantly lower than the 71.8% demonstrated by human pathologists. After fine-tuning, significantly smaller open-sourced LMMs can outperform GPT-4V but still fall short of the expertise shown by pathologists. We hope that the PathMMU will offer valuable insights and foster the development of more specialized, next-generation LMMs for pathology.

PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology

TL;DR

The empirical findings indicate that advanced LMMs struggle with the challenging PathMMU benchmark, with the top-performing LMM, GPT-4V, achieving only a 49.8% zero-shot performance, significantly lower than the 71.8% demonstrated by human pathologists.

Abstract

The emergence of large multimodal models has unlocked remarkable potential in AI, particularly in pathology. However, the lack of specialized, high-quality benchmark impeded their development and precise evaluation. To address this, we introduce PathMMU, the largest and highest-quality expert-validated pathology benchmark for Large Multimodal Models (LMMs). It comprises 33,428 multimodal multi-choice questions and 24,067 images from various sources, each accompanied by an explanation for the correct answer. The construction of PathMMU harnesses GPT-4V's advanced capabilities, utilizing over 30,000 image-caption pairs to enrich captions and generate corresponding Q&As in a cascading process. Significantly, to maximize PathMMU's authority, we invite seven pathologists to scrutinize each question under strict standards in PathMMU's validation and test sets, while simultaneously setting an expert-level performance benchmark for PathMMU. We conduct extensive evaluations, including zero-shot assessments of 14 open-sourced and 4 closed-sourced LMMs and their robustness to image corruption. We also fine-tune representative LMMs to assess their adaptability to PathMMU. The empirical findings indicate that advanced LMMs struggle with the challenging PathMMU benchmark, with the top-performing LMM, GPT-4V, achieving only a 49.8% zero-shot performance, significantly lower than the 71.8% demonstrated by human pathologists. After fine-tuning, significantly smaller open-sourced LMMs can outperform GPT-4V but still fall short of the expertise shown by pathologists. We hope that the PathMMU will offer valuable insights and foster the development of more specialized, next-generation LMMs for pathology.
Paper Structure (19 sections, 12 figures, 7 tables)

This paper contains 19 sections, 12 figures, 7 tables.

Figures (12)

  • Figure 1: An overview of the PathMMU benchmark: PathMMU is constructed using a diverse range of rich data sources. It comprises expert-level, multimodal, multi-choice questions in pathology, collaboratively crafted by AI and human pathology experts. Notably, even the most advanced LMMs substantially underperform when benchmarked against human experts on the PathMMU.
  • Figure 2: The comparison between PathMMU and existing benchmarks. The Q&A pairs in PathMMU are sourced extensively and comprehensively, undergoing rigorous multi-tiered filtering. This includes the initial filtering by multiple LLMs and the strict reviews by professional pathologists. Additionally, each question is accompanied by a detailed explanation. These attributes establish PathMMU as the most professionally curated, comprehensive, and highest-quality large-scale pathology dataset available.
  • Figure 3: An illustrative overview of the three main processes in PathMMU Q&A generation: data collection and preprocessing, detailed pathology image description generation, and question generation with LLMs filtering and expert validation.
  • Figure 4: Left: Illustration of corrupted pathology images. Right: LMM’s performance across various levels of color-related (brightness, hue, saturation) and image quality-related (pixelation, JPEG compression, bubble blur, motion blur, defocus blur) corruptions on the PathMMU test-tiny set, with level 0 representing the uncorrupted images.
  • Figure 5: Left: The performance comparison between different LLMs and human experts on 100 filtered samples where the answer can be guessed through text-only. Right: Expand the sample size to 1600 to validate the source of LLM's ability to guess answers, which includes: (1) Randomly replacing the original questions with others from the dataset while keeping the options unchanged; and (2) utilizing the BERT series for answer selection, specifically through its Next Sentence Prediction (NSP), to assess whether an option is the sequential sentence following a question.
  • ...and 7 more figures