Table of Contents
Fetching ...

Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning

Wentao Huang, Weimin Lyu, Peiliang Lou, Qingqiao Hu, Xiaoling Hu, Shahira Abousamra, Wenchao Han, Ruifeng Guo, Jiawei Zhou, Chao Chen, Chen Wang

TL;DR

The results suggest that bringing human-like search and attention patterns into WSI reasoning is a promising direction for building practical and reliable pathology VLMs, and propose a question-guided, tissue-aware, and coarse-to-fine retrieval framework, HistoSelect, that outperforms existing methods and produces answers grounded in interpretable, pathologist-consistent regions.

Abstract

Computational pathology has advanced rapidly in recent years, driven by domain-specific image encoders and growing interest in using vision-language models to answer natural-language questions about diseases. Yet, the core problem behind pathology question-answering remains unsolved, considering that a gigapixel slide contains far more information than necessary for a given question. Pathologists naturally navigate tissue and morphology complexity by scanning broadly, and zooming in selectively according to the clinical questions. Current models, in contrast, rely on uniform patch sampling or broad attention maps, often attending equally to irrelevant regions while overlooking key visual evidence. In this work, we try to bring models closer to how humans actually examine slides. We propose a question-guided, tissue-aware, and coarse-to-fine retrieval framework, HistoSelect, that consists of two key components: a group sampler that identifies question-relevant tissue regions, followed by a patch selector that retrieves the most informative patches within those regions. By selecting only the most informative patches, our method becomes significantly more efficient: reducing visual token usage by 70% on average, while improving accuracy across three pathology QA tasks. Evaluated on 356,000 question-answer pairs, our approach outperforms existing methods and produces answers grounded in interpretable, pathologist-consistent regions. Our results suggest that bringing human-like search and attention patterns into WSI reasoning is a promising direction for building practical and reliable pathology VLMs.

Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning

TL;DR

The results suggest that bringing human-like search and attention patterns into WSI reasoning is a promising direction for building practical and reliable pathology VLMs, and propose a question-guided, tissue-aware, and coarse-to-fine retrieval framework, HistoSelect, that outperforms existing methods and produces answers grounded in interpretable, pathologist-consistent regions.

Abstract

Computational pathology has advanced rapidly in recent years, driven by domain-specific image encoders and growing interest in using vision-language models to answer natural-language questions about diseases. Yet, the core problem behind pathology question-answering remains unsolved, considering that a gigapixel slide contains far more information than necessary for a given question. Pathologists naturally navigate tissue and morphology complexity by scanning broadly, and zooming in selectively according to the clinical questions. Current models, in contrast, rely on uniform patch sampling or broad attention maps, often attending equally to irrelevant regions while overlooking key visual evidence. In this work, we try to bring models closer to how humans actually examine slides. We propose a question-guided, tissue-aware, and coarse-to-fine retrieval framework, HistoSelect, that consists of two key components: a group sampler that identifies question-relevant tissue regions, followed by a patch selector that retrieves the most informative patches within those regions. By selecting only the most informative patches, our method becomes significantly more efficient: reducing visual token usage by 70% on average, while improving accuracy across three pathology QA tasks. Evaluated on 356,000 question-answer pairs, our approach outperforms existing methods and produces answers grounded in interpretable, pathologist-consistent regions. Our results suggest that bringing human-like search and attention patterns into WSI reasoning is a promising direction for building practical and reliable pathology VLMs.
Paper Structure (22 sections, 12 equations, 8 figures, 6 tables)

This paper contains 22 sections, 12 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Illustration of our HistoSelect framework. (a) The baseline method feeds a large number of patches indiscriminately into the VLM, leading to high redundancy and question-irrelevance. (b) Our question-guided tissue-aware selection method. The question guides the model to select a relevant and sparse subset of informative patches, which are then fed to the VLM for reasoning.
  • Figure 2: Visualization and quantitative pre-analysis of patch relevance for a VQA sample (from TCGA-BRCA). (a) Reference WSI. (b) Tissue segmentation, with tumor region shown in red. (c) Patch-level relevance heatmap based on question-patch similarity. High-relevance regions (light region) align with the tumor region from (b). (d) F1 score comparison for retrieving tumor patches using different sampling methods. The question-guided (red) sampling strategy vastly outperforms question-agnostic methods like diversity sampling alvar2025divprune (blue) and random sampling (gray), demonstrating the limited efficacy of non-guided selection.
  • Figure 3: Overview of HistoSelect. The framework operates in two stages: Tissue Segmentation partitions the WSI into $M$ tissue types (e.g., tumor, stromal, lymphocyte) using pathologist-designed prompts. The Hierarchical Selector then uses the question feature to dynamically select the top $K$ most relevant patch tokens, which are subsequently passed to the LLM for multi-modal answer generation.
  • Figure 4: Visualization tissue segmentation and selection process for tumor patches. (a) Original WSI. (b) Tissue segmentation mask. (c)Visualization of patches before selection (a randomly selected subset is shown for clarity). (d) Visualization of patches after selection. Compared to (c), the patches selected by our model in (d) significantly remove non-tumor patches, demonstrating an improved focus on informative tumor-related regions.
  • Figure 5: The user interface for the Tissue Segmentation Survey. The central area shows a side-by-side comparison of the original WSI and the tissue segmentation result. A detailed legend on the right clarifies the tissue classes, and the bottom section allows pathologists to submit their ratings.
  • ...and 3 more figures