Table of Contents
Fetching ...

WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering

Pingyi Chen, Chenglu Zhu, Sunyi Zheng, Honglin Li, Lin Yang

TL;DR

The paper tackles the challenge of interpreting gigapixel whole-slide images (WSIs) by reframing slide-level pathology tasks as generative visual question answering. It introduces the Wsi2Text Transformer (W2T), a multimodal model that aligns patch-level visual embeddings with word-level question embeddings via co-attention to generate free-form answers, and it curates the first WSI-VQA dataset with 977 WSIs and 8672 QA pairs (close-ended and open-ended). The results show W2T competitive performance across histological subtyping, biomarker prediction, and survival tasks while offering interpretable co-attention heatmaps that highlight clinically relevant regions. This work advances computational pathology by providing a scalable, unified framework and a public dataset that can underpin future multimodal large language models in the domain, with potential for clinical impact and broader applicability to large-resolution modalities.

Abstract

Whole slide imaging is routinely adopted for carcinoma diagnosis and prognosis. Abundant experience is required for pathologists to achieve accurate and reliable diagnostic results of whole slide images (WSI). The huge size and heterogeneous features of WSIs make the workflow of pathological reading extremely time-consuming. In this paper, we propose a novel framework (WSI-VQA) to interpret WSIs by generative visual question answering. WSI-VQA shows universality by reframing various kinds of slide-level tasks in a question-answering pattern, in which pathologists can achieve immunohistochemical grading, survival prediction, and tumor subtyping following human-machine interaction. Furthermore, we establish a WSI-VQA dataset which contains 8672 slide-level question-answering pairs with 977 WSIs. Besides the ability to deal with different slide-level tasks, our generative model which is named Wsi2Text Transformer (W2T) outperforms existing discriminative models in medical correctness, which reveals the potential of our model to be applied in the clinical scenario. Additionally, we also visualize the co-attention mapping between word embeddings and WSIs as an intuitive explanation for diagnostic results. The dataset and related code are available at https://github.com/cpystan/WSI-VQA.

WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering

TL;DR

The paper tackles the challenge of interpreting gigapixel whole-slide images (WSIs) by reframing slide-level pathology tasks as generative visual question answering. It introduces the Wsi2Text Transformer (W2T), a multimodal model that aligns patch-level visual embeddings with word-level question embeddings via co-attention to generate free-form answers, and it curates the first WSI-VQA dataset with 977 WSIs and 8672 QA pairs (close-ended and open-ended). The results show W2T competitive performance across histological subtyping, biomarker prediction, and survival tasks while offering interpretable co-attention heatmaps that highlight clinically relevant regions. This work advances computational pathology by providing a scalable, unified framework and a public dataset that can underpin future multimodal large language models in the domain, with potential for clinical impact and broader applicability to large-resolution modalities.

Abstract

Whole slide imaging is routinely adopted for carcinoma diagnosis and prognosis. Abundant experience is required for pathologists to achieve accurate and reliable diagnostic results of whole slide images (WSI). The huge size and heterogeneous features of WSIs make the workflow of pathological reading extremely time-consuming. In this paper, we propose a novel framework (WSI-VQA) to interpret WSIs by generative visual question answering. WSI-VQA shows universality by reframing various kinds of slide-level tasks in a question-answering pattern, in which pathologists can achieve immunohistochemical grading, survival prediction, and tumor subtyping following human-machine interaction. Furthermore, we establish a WSI-VQA dataset which contains 8672 slide-level question-answering pairs with 977 WSIs. Besides the ability to deal with different slide-level tasks, our generative model which is named Wsi2Text Transformer (W2T) outperforms existing discriminative models in medical correctness, which reveals the potential of our model to be applied in the clinical scenario. Additionally, we also visualize the co-attention mapping between word embeddings and WSIs as an intuitive explanation for diagnostic results. The dataset and related code are available at https://github.com/cpystan/WSI-VQA.
Paper Structure (23 sections, 3 equations, 5 figures, 2 tables)

This paper contains 23 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Flow diagram of the WSI-VQA construction. We first select pathological captions and clinical indexes in TCGA. Then, close-ended and open-ended VQA pairs are obtained through GPT and fixing templates respectively. Finally, clinical validation is adopted to remove the pairs which are flawed or can not be inferred from the WSI.
  • Figure 2: (a) The distribution of biological entities in our proposed dataset. The center circle shows the distribution of close-ended pairs and open-ended pairs. The outer circle presents the entities in the dataset from which we can see that our questions cover a wide range of slide-level characteristics. Several examples of questions are also demonstrated. (b) The frequency of different categories of questions set by the first word. The 'what' questions dominate the VQA pairs with a frequency of $79.9\%$. (c) The ratio of various entities in the open-ended subset. 'Her-2' and 'PR' are two biomarkers for breast cancer.
  • Figure 3: Sturcture of our proposed VQA model. First, the WSI and the question are processed and tokenized for subsequent stages. Then, various visual and text extractors are applied to extract embeddings given visual and word tokens. $T_e$ stands for the transformer encoder and $T_l$ represents the linear mapping, aiming to achieve alignment between visual and word embeddings. The interaction mechanism between different modalities, which is instantiated as the co-attention in the transformer decoder $T_d$, is adopted to generate the answer.
  • Figure 4: Several examples of the WSIs and their corresponding VQA pairs. The pairs in the grey rectangle are from the close-ended subset which has multiple choices while the ones in the blue rectangle are from the open-ended subset. The choice which is underlined is the right answer. These questions are all challenging because they require sufficient medical knowledge and understanding of complex characteristics in the gigapixel WSIs.
  • Figure 5: Co-attention visualization with corresponding questions. The keyword in the question, which is marked in a red rectangle, guides the interaction and aggregation of image patches using co-attention weights. The heatmap reflects how much each patch in the WSI is relevant to the word token ranging from blue to red.