VLEER: Vision and Language Embeddings for Explainable Whole Slide Image Representation
Anh Tien Nguyen, Keunho Byeon, Kyungeun Kim, Jin Tae Kwak
TL;DR
VLEER addresses the gap in whole-slide image analysis by leveraging pre-trained vision–language models to create explainable WSI representations. It introduces a task-specific pathology text pool and a clustering-based alignment of image patches with keywords to generate vision–language embeddings, augmented with cluster-level textual cues. The approach provides region-level explanations via ReVL annotations and MIL-attention heatmaps, demonstrated across three TCGA datasets with multiple MIL aggregators, showing both quantitative gains and interpretable qualitative insights. The work advances clinically relevant explainability in computational pathology by linking predictive signals to human-readable pathology terms and spatial regions, potentially improving trust and adoption in diagnostic workflows.
Abstract
Recent advances in vision-language models (VLMs) have shown remarkable potential in bridging visual and textual modalities. In computational pathology, domain-specific VLMs, which are pre-trained on extensive histopathology image-text datasets, have succeeded in various downstream tasks. However, existing research has primarily focused on the pre-training process and direct applications of VLMs on the patch level, leaving their great potential for whole slide image (WSI) applications unexplored. In this study, we hypothesize that pre-trained VLMs inherently capture informative and interpretable WSI representations through quantitative feature extraction. To validate this hypothesis, we introduce Vision and Language Embeddings for Explainable WSI Representation (VLEER), a novel method designed to leverage VLMs for WSI representation. We systematically evaluate VLEER on three pathological WSI datasets, proving its better performance in WSI analysis compared to conventional vision features. More importantly, VLEER offers the unique advantage of interpretability, enabling direct human-readable insights into the results by leveraging the textual modality for detailed pathology annotations, providing clear reasoning for WSI-level pathology downstream tasks.
