Table of Contents
Fetching ...

SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding

Ying Chen, Guoan Wang, Yuanfeng Ji, Yanjun Li, Jin Ye, Tianbin Li, Ming Hu, Rongshan Yu, Yu Qiao, Junjun He

TL;DR

SlideChat introduces a large vision-language assistant capable of understanding gigapixel whole-slide pathology images, addressing the limitations of patch-focused MLLMs. It combines a patch-level encoder, a slide-level encoder with sparse attention, and an LLM via a multimodal projector, trained in two stages to align visual and textual modalities and then learn visual instructions. The authors build SlideInstruction (4.2K WSI captions and 176K VQA pairs) and SlideBench (captioning and VQA benchmarks across TCGA and BCNB) to enable rigorous evaluation, achieving state-of-the-art performance on 18 of 22 tasks and demonstrating strong cross-domain generalization. The work provides open-source releases of SlideChat, SlideInstruction, and SlideBench, offering a resource-rich platform to advance computational pathology and the development of generalized, clinically grounded MLLMs. Overall, SlideChat bridges the gap between vision-language models and whole-slide pathology, enabling richer, context-aware analysis and potential clinical impact.

Abstract

Despite the progress made by multimodal large language models (MLLMs) in computational pathology, they remain limited by a predominant focus on patch-level analysis, missing essential contextual information at the whole-slide level. The lack of large-scale instruction datasets and the gigapixel scale of whole slide images (WSIs) pose significant developmental challenges. In this paper, we present SlideChat, the first vision-language assistant capable of understanding gigapixel whole-slide images, exhibiting excellent multimodal conversational capability and response complex instruction across diverse pathology scenarios. To support its development, we created SlideInstruction, the largest instruction-following dataset for WSIs consisting of 4.2K WSI captions and 176K VQA pairs with multiple categories. Furthermore, we propose SlideBench, a multimodal benchmark that incorporates captioning and VQA tasks to assess SlideChat's capabilities in varied clinical settings such as microscopy, diagnosis. Compared to both general and specialized MLLMs, SlideChat exhibits exceptional capabilities achieving state-of-the-art performance on 18 of 22 tasks. For example, it achieved an overall accuracy of 81.17% on SlideBench-VQA (TCGA), and 54.15% on SlideBench-VQA (BCNB). Our code, data, and model is publicly accessible at https://uni-medical.github.io/SlideChat.github.io.

SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding

TL;DR

SlideChat introduces a large vision-language assistant capable of understanding gigapixel whole-slide pathology images, addressing the limitations of patch-focused MLLMs. It combines a patch-level encoder, a slide-level encoder with sparse attention, and an LLM via a multimodal projector, trained in two stages to align visual and textual modalities and then learn visual instructions. The authors build SlideInstruction (4.2K WSI captions and 176K VQA pairs) and SlideBench (captioning and VQA benchmarks across TCGA and BCNB) to enable rigorous evaluation, achieving state-of-the-art performance on 18 of 22 tasks and demonstrating strong cross-domain generalization. The work provides open-source releases of SlideChat, SlideInstruction, and SlideBench, offering a resource-rich platform to advance computational pathology and the development of generalized, clinically grounded MLLMs. Overall, SlideChat bridges the gap between vision-language models and whole-slide pathology, enabling richer, context-aware analysis and potential clinical impact.

Abstract

Despite the progress made by multimodal large language models (MLLMs) in computational pathology, they remain limited by a predominant focus on patch-level analysis, missing essential contextual information at the whole-slide level. The lack of large-scale instruction datasets and the gigapixel scale of whole slide images (WSIs) pose significant developmental challenges. In this paper, we present SlideChat, the first vision-language assistant capable of understanding gigapixel whole-slide images, exhibiting excellent multimodal conversational capability and response complex instruction across diverse pathology scenarios. To support its development, we created SlideInstruction, the largest instruction-following dataset for WSIs consisting of 4.2K WSI captions and 176K VQA pairs with multiple categories. Furthermore, we propose SlideBench, a multimodal benchmark that incorporates captioning and VQA tasks to assess SlideChat's capabilities in varied clinical settings such as microscopy, diagnosis. Compared to both general and specialized MLLMs, SlideChat exhibits exceptional capabilities achieving state-of-the-art performance on 18 of 22 tasks. For example, it achieved an overall accuracy of 81.17% on SlideBench-VQA (TCGA), and 54.15% on SlideBench-VQA (BCNB). Our code, data, and model is publicly accessible at https://uni-medical.github.io/SlideChat.github.io.

Paper Structure

This paper contains 52 sections, 9 figures, 10 tables.

Figures (9)

  • Figure 1: SlideChat is the first large vision-language assistant specifically designed for whole-slide pathology analysis. SlideChat can generate comprehensive descriptions of whole-slide images and provide contextually relevant responses across various applications.
  • Figure 2: Overview of our SlideChat. (A) SlideChat serializes each input WSI into a sequence of 224×224 patches, converting each into visual embeddings with a patch-level encoder. A slide-level encoder then interacts with these features to generate contextual embeddings. Then, a multimodal projector maps the visual features from the slide-level encoder into a unified space, aligned seamlessly with the LLM. (B) SlideChat was trained for two stages: Cross-Domain Alignment and Visual Instruction Learning.
  • Figure 3: (A) Overview of the SlideInstruction generation pipeline. We prompt GPT-4 to extract the WSI-Caption, Open-set VQA and Closed-set VQA from reports. (B) For the generated Closed-set VQA, we employ LLMs to filter low-quality QA pairs and involve pathologists for validation, resulting in the creation of SlideBench-VQA. (C) Examples of WSI caption and instruction-following scenarios in microscopy, diagnostics, and clinical applications. For additional examples, please refer to Supplementary Material.
  • Figure 4: Accuracy on different tasks in SlideBench-VQA (TCGA) (left) and SlideBench-VQA (BCNB) (right).
  • Figure 5: Interpretability and visualization. We identify the top five patch tokens with the highest attention scores associated with the output text responses.
  • ...and 4 more figures