Table of Contents
Fetching ...

Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction

Hao Li, Ying Chen, Yifei Chen, Wenxian Yang, Bowen Ding, Yuchen Han, Liansheng Wang, Rongshan Yu

TL;DR

This paper tackles poor generalization in WSI classification caused by coarse-text supervision by introducing FiVE, a Fine-grained Visual-Semantic Interaction framework. FiVE leverages non-standardized WSI-report pairs and GPT-4–driven text standardization to create fine-grained labels, then uses a Task-specific Fine-grained Semantics (TFS) module and a learnable Diagnosis Prompt system to guide visual-semantic alignment through a Patch Sampling strategy for efficiency. The model employs a frozen image encoder and a text encoder (BioClinicalBERT with LoRA) to compute bag-level embeddings that are aligned via symmetric contrastive losses, resulting in strong zero-shot and few-shot transfer performance on TCGA-Lung and Camelyon16, outperforming state-of-the-art MIL-based and VLM-based methods. These contributions yield robust generalization with reduced computational cost, enabling practical deployment in computational pathology and potential adaptation to other high-resolution medical imaging tasks, while highlighting the value of fine-grained textual supervision in pathology. In particular, FiVE achieves notable gains in few-shot TCGA-Lung (e.g., at least 9.19% higher accuracy) and demonstrates competitive zero-shot histological subtype classification, thanks to the fine-grained semantics and prompt-driven instance aggregation.

Abstract

Whole Slide Image (WSI) classification is often formulated as a Multiple Instance Learning (MIL) problem. Recently, Vision-Language Models (VLMs) have demonstrated remarkable performance in WSI classification. However, existing methods leverage coarse-grained pathogenetic descriptions for visual representation supervision, which are insufficient to capture the complex visual appearance of pathogenetic images, hindering the generalizability of models on diverse downstream tasks. Additionally, processing high-resolution WSIs can be computationally expensive. In this paper, we propose a novel "Fine-grained Visual-Semantic Interaction" (FiVE) framework for WSI classification. It is designed to enhance the model's generalizability by leveraging the interaction between localized visual patterns and fine-grained pathological semantics. Specifically, with meticulously designed queries, we start by utilizing a large language model to extract fine-grained pathological descriptions from various non-standardized raw reports. The output descriptions are then reconstructed into fine-grained labels used for training. By introducing a Task-specific Fine-grained Semantics (TFS) module, we enable prompts to capture crucial visual information in WSIs, which enhances representation learning and augments generalization capabilities significantly. Furthermore, given that pathological visual patterns are redundantly distributed across tissue slices, we sample a subset of visual instances during training. Our method demonstrates robust generalizability and strong transferability, dominantly outperforming the counterparts on the TCGA Lung Cancer dataset with at least 9.19% higher accuracy in few-shot experiments. The code is available at: https://github.com/ls1rius/WSI_FiVE.

Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction

TL;DR

This paper tackles poor generalization in WSI classification caused by coarse-text supervision by introducing FiVE, a Fine-grained Visual-Semantic Interaction framework. FiVE leverages non-standardized WSI-report pairs and GPT-4–driven text standardization to create fine-grained labels, then uses a Task-specific Fine-grained Semantics (TFS) module and a learnable Diagnosis Prompt system to guide visual-semantic alignment through a Patch Sampling strategy for efficiency. The model employs a frozen image encoder and a text encoder (BioClinicalBERT with LoRA) to compute bag-level embeddings that are aligned via symmetric contrastive losses, resulting in strong zero-shot and few-shot transfer performance on TCGA-Lung and Camelyon16, outperforming state-of-the-art MIL-based and VLM-based methods. These contributions yield robust generalization with reduced computational cost, enabling practical deployment in computational pathology and potential adaptation to other high-resolution medical imaging tasks, while highlighting the value of fine-grained textual supervision in pathology. In particular, FiVE achieves notable gains in few-shot TCGA-Lung (e.g., at least 9.19% higher accuracy) and demonstrates competitive zero-shot histological subtype classification, thanks to the fine-grained semantics and prompt-driven instance aggregation.

Abstract

Whole Slide Image (WSI) classification is often formulated as a Multiple Instance Learning (MIL) problem. Recently, Vision-Language Models (VLMs) have demonstrated remarkable performance in WSI classification. However, existing methods leverage coarse-grained pathogenetic descriptions for visual representation supervision, which are insufficient to capture the complex visual appearance of pathogenetic images, hindering the generalizability of models on diverse downstream tasks. Additionally, processing high-resolution WSIs can be computationally expensive. In this paper, we propose a novel "Fine-grained Visual-Semantic Interaction" (FiVE) framework for WSI classification. It is designed to enhance the model's generalizability by leveraging the interaction between localized visual patterns and fine-grained pathological semantics. Specifically, with meticulously designed queries, we start by utilizing a large language model to extract fine-grained pathological descriptions from various non-standardized raw reports. The output descriptions are then reconstructed into fine-grained labels used for training. By introducing a Task-specific Fine-grained Semantics (TFS) module, we enable prompts to capture crucial visual information in WSIs, which enhances representation learning and augments generalization capabilities significantly. Furthermore, given that pathological visual patterns are redundantly distributed across tissue slices, we sample a subset of visual instances during training. Our method demonstrates robust generalizability and strong transferability, dominantly outperforming the counterparts on the TCGA Lung Cancer dataset with at least 9.19% higher accuracy in few-shot experiments. The code is available at: https://github.com/ls1rius/WSI_FiVE.
Paper Structure (32 sections, 7 equations, 5 figures, 6 tables)

This paper contains 32 sections, 7 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Challenges in WSI-text contrastive learning. Most conventional VLM approaches categorize whole slide images using category-level text descriptions, overlooking intra-class differences, leading to a decline in performance and limitations in generalization capabilities. Instead, we extract fine-grained descriptions from pathology reports as slide-level labels to develop our model, exhibiting detailed variations in each WSI.
  • Figure 2: Left: The structure of the FiVE framework. The model consists of a frozen image encoder, a text encoder, and the TFS module. Whole slide images are divided into instances for embedding extraction by the image encoder. Raw pathology reports are standardized by GPT-4 into fine-grained descriptions. The fine-grained descriptions and manual prompts are sampled, shuffled, and reconstructed in pairs. These prompts aggregate instances into bag-level features, subsequently aligned with the descriptions utilizing contrastive loss. Top Right: Fine-grained pathological descriptions. The fine-grained pathological descriptions are generated from multiple answers based on specific queries. These descriptions undergo a process of random sampling, shuffling, and reconstruction to form a unified sentence. Bottom Right: The Instance Aggregator module. The instance aggregator consists of a self-attention module and a cross-attention module, fusing image instance embeddings and prompt embeddings to create bag-level features.
  • Figure 3: Classification performance on TCGA Lung Cancer with diverse sampling strategies, presenting average and standard deviation (std) ACC values.
  • Figure 4: Pathology report examples. We randomly sample several pathology reports with different reporting standards for display. Sensitive information has been masked.
  • Figure 5: Fine-grained guidance construction pipeline.