Table of Contents
Fetching ...

Slide-Level Prompt Learning with Vision Language Models for Few-Shot Multiple Instance Learning in Histopathology

Devavrat Tomar, Guillaume Vray, Dwarikanath Mahapatra, Sudipta Roy, Jean-Philippe Thiran, Behzad Bozorgtabar

TL;DR

The paper tackles few-shot whole slide image classification in histopathology, where gigapixel WSIs and limited annotations hinder traditional supervised learning. It introduces SLIP, a slide-level prompt learning framework that integrates vision-language models with language-model derived tissue priors to identify informative patches and align them with WSI class descriptors. Key contributions include dual similarity pooling between patch and tissue prompts and between tissue descriptors and WSI classes, plus learnable slide-level prompts optimized with a supervised InfoNCE loss using only a few labeled WSIs. Evaluations on DHMC and PatchGastric datasets demonstrate superior performance over state-of-the-art MIL and VLM baselines, with improved interpretability via tissue-focused patch attribution, offering a data-efficient and potentially more generalizable approach for histopathology WSI analysis.

Abstract

In this paper, we address the challenge of few-shot classification in histopathology whole slide images (WSIs) by utilizing foundational vision-language models (VLMs) and slide-level prompt learning. Given the gigapixel scale of WSIs, conventional multiple instance learning (MIL) methods rely on aggregation functions to derive slide-level (bag-level) predictions from patch representations, which require extensive bag-level labels for training. In contrast, VLM-based approaches excel at aligning visual embeddings of patches with candidate class text prompts but lack essential pathological prior knowledge. Our method distinguishes itself by utilizing pathological prior knowledge from language models to identify crucial local tissue types (patches) for WSI classification, integrating this within a VLM-based MIL framework. Our approach effectively aligns patch images with tissue types, and we fine-tune our model via prompt learning using only a few labeled WSIs per category. Experimentation on real-world pathological WSI datasets and ablation studies highlight our method's superior performance over existing MIL- and VLM-based methods in few-shot WSI classification tasks. Our code is publicly available at https://github.com/LTS5/SLIP.

Slide-Level Prompt Learning with Vision Language Models for Few-Shot Multiple Instance Learning in Histopathology

TL;DR

The paper tackles few-shot whole slide image classification in histopathology, where gigapixel WSIs and limited annotations hinder traditional supervised learning. It introduces SLIP, a slide-level prompt learning framework that integrates vision-language models with language-model derived tissue priors to identify informative patches and align them with WSI class descriptors. Key contributions include dual similarity pooling between patch and tissue prompts and between tissue descriptors and WSI classes, plus learnable slide-level prompts optimized with a supervised InfoNCE loss using only a few labeled WSIs. Evaluations on DHMC and PatchGastric datasets demonstrate superior performance over state-of-the-art MIL and VLM baselines, with improved interpretability via tissue-focused patch attribution, offering a data-efficient and potentially more generalizable approach for histopathology WSI analysis.

Abstract

In this paper, we address the challenge of few-shot classification in histopathology whole slide images (WSIs) by utilizing foundational vision-language models (VLMs) and slide-level prompt learning. Given the gigapixel scale of WSIs, conventional multiple instance learning (MIL) methods rely on aggregation functions to derive slide-level (bag-level) predictions from patch representations, which require extensive bag-level labels for training. In contrast, VLM-based approaches excel at aligning visual embeddings of patches with candidate class text prompts but lack essential pathological prior knowledge. Our method distinguishes itself by utilizing pathological prior knowledge from language models to identify crucial local tissue types (patches) for WSI classification, integrating this within a VLM-based MIL framework. Our approach effectively aligns patch images with tissue types, and we fine-tune our model via prompt learning using only a few labeled WSIs per category. Experimentation on real-world pathological WSI datasets and ablation studies highlight our method's superior performance over existing MIL- and VLM-based methods in few-shot WSI classification tasks. Our code is publicly available at https://github.com/LTS5/SLIP.

Paper Structure

This paper contains 5 sections, 5 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of the proposed method vs. existing MIL- and VLM-based approaches for few-shot WSI classification. (a) Conventional MIL methods use pooling functions like Average Pooling for slide-level features; (b) VLM-based methods measure similarity between patches and WSI text prompts, using e.g., Top-K Pooling; (c) Our SLIP framework introduces SLIP pooling by computing similarity $\mathbf{S_\text{tissue}^\text{patch}}$ between patch features and tissue-specific text embeddings from ChatGPT, and $\mathbf{S_\text{tissue}^\text{wsi}}$ between whole-slide and tissue type names, aggregating class-specific features as $\mathbf{F}_\text{wsi}$.
  • Figure 2: Accuracy on the 3-category lung adenocarcinoma classification task (DHMC dataset). All baselines use the ViT-B/16 encoder from CLIP, with mean accuracy and standard deviation (error bars) shown across 10 runs.
  • Figure 3: SLIP similarity scores on Lung WSIs. (a) Heatmaps for solid pattern adenocarcinoma. (b) Patches with highest (top) and lowest (bottom) similarity scores.