Table of Contents
Fetching ...

VLEER: Vision and Language Embeddings for Explainable Whole Slide Image Representation

Anh Tien Nguyen, Keunho Byeon, Kyungeun Kim, Jin Tae Kwak

TL;DR

VLEER addresses the gap in whole-slide image analysis by leveraging pre-trained vision–language models to create explainable WSI representations. It introduces a task-specific pathology text pool and a clustering-based alignment of image patches with keywords to generate vision–language embeddings, augmented with cluster-level textual cues. The approach provides region-level explanations via ReVL annotations and MIL-attention heatmaps, demonstrated across three TCGA datasets with multiple MIL aggregators, showing both quantitative gains and interpretable qualitative insights. The work advances clinically relevant explainability in computational pathology by linking predictive signals to human-readable pathology terms and spatial regions, potentially improving trust and adoption in diagnostic workflows.

Abstract

Recent advances in vision-language models (VLMs) have shown remarkable potential in bridging visual and textual modalities. In computational pathology, domain-specific VLMs, which are pre-trained on extensive histopathology image-text datasets, have succeeded in various downstream tasks. However, existing research has primarily focused on the pre-training process and direct applications of VLMs on the patch level, leaving their great potential for whole slide image (WSI) applications unexplored. In this study, we hypothesize that pre-trained VLMs inherently capture informative and interpretable WSI representations through quantitative feature extraction. To validate this hypothesis, we introduce Vision and Language Embeddings for Explainable WSI Representation (VLEER), a novel method designed to leverage VLMs for WSI representation. We systematically evaluate VLEER on three pathological WSI datasets, proving its better performance in WSI analysis compared to conventional vision features. More importantly, VLEER offers the unique advantage of interpretability, enabling direct human-readable insights into the results by leveraging the textual modality for detailed pathology annotations, providing clear reasoning for WSI-level pathology downstream tasks.

VLEER: Vision and Language Embeddings for Explainable Whole Slide Image Representation

TL;DR

VLEER addresses the gap in whole-slide image analysis by leveraging pre-trained vision–language models to create explainable WSI representations. It introduces a task-specific pathology text pool and a clustering-based alignment of image patches with keywords to generate vision–language embeddings, augmented with cluster-level textual cues. The approach provides region-level explanations via ReVL annotations and MIL-attention heatmaps, demonstrated across three TCGA datasets with multiple MIL aggregators, showing both quantitative gains and interpretable qualitative insights. The work advances clinically relevant explainability in computational pathology by linking predictive signals to human-readable pathology terms and spatial regions, potentially improving trust and adoption in diagnostic workflows.

Abstract

Recent advances in vision-language models (VLMs) have shown remarkable potential in bridging visual and textual modalities. In computational pathology, domain-specific VLMs, which are pre-trained on extensive histopathology image-text datasets, have succeeded in various downstream tasks. However, existing research has primarily focused on the pre-training process and direct applications of VLMs on the patch level, leaving their great potential for whole slide image (WSI) applications unexplored. In this study, we hypothesize that pre-trained VLMs inherently capture informative and interpretable WSI representations through quantitative feature extraction. To validate this hypothesis, we introduce Vision and Language Embeddings for Explainable WSI Representation (VLEER), a novel method designed to leverage VLMs for WSI representation. We systematically evaluate VLEER on three pathological WSI datasets, proving its better performance in WSI analysis compared to conventional vision features. More importantly, VLEER offers the unique advantage of interpretability, enabling direct human-readable insights into the results by leveraging the textual modality for detailed pathology annotations, providing clear reasoning for WSI-level pathology downstream tasks.

Paper Structure

This paper contains 13 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Vision-language embedding generation in VLEER. Tiled patches and curated keywords are embedded using a pre-trained VLM’s vision and text encoders. Clustering of vision embeddings, similarities between text and clustered vision embeddings are then measured to select the top-K keywords. These keywords are combined and used to obtain cluster-level language embeddings, which are then concatenated with the corresponding vision embeddings, forming vision-language embeddings.
  • Figure 2: The ReVL annotations and attention heatmap of a papillary renal cell carcinoma in TCGA-RCC. The highly attended regions (red) in the heatmap are closely related to the patterns of papillary cancer, whereas the low attended regions (green and blue) are normal histology of renal tissues.
  • Figure 3: The ReVL annotations and attention heatmap of a lung squamous cell carcinoma in TCGA-NSCLC. The highly attended regions (red) in the heatmap are closely related to the patterns of squamous cancer, whereas the low attended regions (green and blue) are normal histology of lung tissues.