Table of Contents
Fetching ...

Efficient Whole Slide Pathology VQA via Token Compression

Weimin Lyu, Qingqiao Hu, Kehan Qi, Zhan Shi, Wentao Huang, Saumya Gupta, Chao Chen

TL;DR

This work tackles the challenge of visual question answering on gigapixel whole-slide images by introducing TCP-LLaVA, a multimodal LLM that compresses thousands of patch- and text-tokens into a small set of trainable compression tokens via a Modality Compression Module. Only these fixed-length compressed tokens are passed to the LLM, enabling end-to-end VQA with dramatically reduced input length and computational cost while retaining diagnostic reasoning. The approach achieves state-of-the-art accuracy on a TCGA-based multi-tumor QA benchmark (average $78.57\%$) and delivers substantial efficiency gains (input reduction >$99\%$; TFLOPS and throughput improvements). This token-compression paradigm makes scalable WSI VQA feasible on standard hardware and opens avenues for extending to generative pathology tasks.

Abstract

Whole-slide images (WSIs) in pathology can reach up to 10,000 x 10,000 pixels, posing significant challenges for multimodal large language model (MLLM) due to long context length and high computational demands. Previous methods typically focus on patch-level analysis or slide-level classification using CLIP-based models with multi-instance learning, but they lack the generative capabilities needed for visual question answering (VQA). More recent MLLM-based approaches address VQA by feeding thousands of patch tokens directly into the language model, which leads to excessive resource consumption. To address these limitations, we propose Token Compression Pathology LLaVA (TCP-LLaVA), the first MLLM architecture to perform WSI VQA via token compression. TCP-LLaVA introduces a set of trainable compression tokens that aggregate visual and textual information through a modality compression module, inspired by the [CLS] token mechanism in BERT. Only the compressed tokens are forwarded to the LLM for answer generation, significantly reducing input length and computational cost. Experiments on ten TCGA tumor subtypes show that TCP-LLaVA outperforms existing MLLM baselines in VQA accuracy while reducing training resource consumption by a substantial margin.

Efficient Whole Slide Pathology VQA via Token Compression

TL;DR

This work tackles the challenge of visual question answering on gigapixel whole-slide images by introducing TCP-LLaVA, a multimodal LLM that compresses thousands of patch- and text-tokens into a small set of trainable compression tokens via a Modality Compression Module. Only these fixed-length compressed tokens are passed to the LLM, enabling end-to-end VQA with dramatically reduced input length and computational cost while retaining diagnostic reasoning. The approach achieves state-of-the-art accuracy on a TCGA-based multi-tumor QA benchmark (average ) and delivers substantial efficiency gains (input reduction >; TFLOPS and throughput improvements). This token-compression paradigm makes scalable WSI VQA feasible on standard hardware and opens avenues for extending to generative pathology tasks.

Abstract

Whole-slide images (WSIs) in pathology can reach up to 10,000 x 10,000 pixels, posing significant challenges for multimodal large language model (MLLM) due to long context length and high computational demands. Previous methods typically focus on patch-level analysis or slide-level classification using CLIP-based models with multi-instance learning, but they lack the generative capabilities needed for visual question answering (VQA). More recent MLLM-based approaches address VQA by feeding thousands of patch tokens directly into the language model, which leads to excessive resource consumption. To address these limitations, we propose Token Compression Pathology LLaVA (TCP-LLaVA), the first MLLM architecture to perform WSI VQA via token compression. TCP-LLaVA introduces a set of trainable compression tokens that aggregate visual and textual information through a modality compression module, inspired by the [CLS] token mechanism in BERT. Only the compressed tokens are forwarded to the LLM for answer generation, significantly reducing input length and computational cost. Experiments on ten TCGA tumor subtypes show that TCP-LLaVA outperforms existing MLLM baselines in VQA accuracy while reducing training resource consumption by a substantial margin.

Paper Structure

This paper contains 13 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Comparison of different WSI modeling paradigms for pathology tasks. (a) MIL-based methods aggregate patch-level features into a slide-level representation via aggregation modules, typically used for classification. (b) CLIP-based methods compute similarity scores between each patch and text prompt embeddings, followed by MIL aggregation for classification tasks. (c) MLLM-based methods directly feed all extracted visual tokens (often exceeding 10K) into a large language model (LLM) along with text tokens for end-to-end answer generation. (d) Our TCP-LLaVA introduces a modality compression module that distills thousands of patches and text tokens into a compact set of trainable compression tokens. This architecture significantly reduces computational load while maintaining high performance on gigapixel-scale VQA tasks.
  • Figure 2: Overall architecture of TCP-LLaVA for visual question answering on whole slide images (WSIs). Given a high-resolution WSI, non-overlapping patches are extracted and passed through a pretrained visual encoder, followed by a projector that aligns visual features to the LLM embedding space, producing visual tokens. Meanwhile, the question and answer choices are tokenized into text tokens. These, along with a set of special trainable compression tokens, are input to the Modality Compression Module, which performs cross-modal attention to distill a compact representation. Only the updated compressed tokens are forwarded to the large language model (LLM) for answer generation. This design enables efficient and scalable reasoning on gigapixel WSIs while significantly reducing input sequence length and computational cost.
  • Figure 3: Visualization of the tumor-type distribution in our curated multi-tumor VQA benchmark dataset, constructed from TCGA TCGA_GDC and refined with annotations from SlideBench chen2025slidechat. Each colored segment represents one of ten tumor types, with the arc length proportional to the number of associated question-answer (QA) pairs. The dataset includes BRCA (37,564), LGG (30,074), COAD (14,481), HNSC (13,615), GBM (21,009), LUAD (17,969), LUSC (16,438), BLCA (15,294), SKCM (4,066), and READ (5,287). For each tumor type, we illustrate a representative whole slide image (WSI) and a corresponding clinical question-answer pair, highlighting the dataset’s diversity and the depth of pathology-informed reasoning.
  • Figure 4: Radar chart of the performance comparison across ten tumor types on the TCGA benchmark. The accuracy of TCP-LLaVA is illustrated for each tumor type.