Table of Contents
Fetching ...

Parameter-efficient Prompt Tuning and Hierarchical Textual Guidance for Few-shot Whole Slide Image Classification

Jayanie Bogahawatte, Sachith Seneviratne, Saman Halgamuge

Abstract

Whole Slide Images (WSIs) are giga-pixel in scale and are typically partitioned into small instances in WSI classification pipelines for computational feasibility. However, obtaining extensive instance level annotations is costly, making few-shot weakly supervised WSI classification (FSWC) crucial for learning from limited slide-level labels. Recently, pre-trained vision-language models (VLMs) have been adopted in FSWC, yet they exhibit several limitations. Existing prompt tuning methods in FSWC substantially increase both the number of trainable parameters and inference overhead. Moreover, current methods discard instances with low alignment to text embeddings from VLMs, potentially leading to information loss. To address these challenges, we propose two key contributions. First, we introduce a new parameter efficient prompt tuning method by scaling and shifting features in text encoder, which significantly reduces the computational cost. Second, to leverage not only the pre-trained knowledge of VLMs, but also the inherent hierarchical structure of WSIs, we introduce a WSI representation learning approach with a soft hierarchical textual guidance strategy without utilizing hard instance filtering. Comprehensive evaluations on pathology datasets covering breast, lung, and ovarian cancer types demonstrate consistent improvements up-to 10.9%, 7.8%, and 13.8% respectively, over the state-of-the-art methods in FSWC. Our method reduces the number of trainable parameters by 18.1% on both breast and lung cancer datasets, and 5.8% on the ovarian cancer dataset, while also excelling at weakly-supervised tumor localization. Code at https://github.com/Jayanie/HIPSS.

Parameter-efficient Prompt Tuning and Hierarchical Textual Guidance for Few-shot Whole Slide Image Classification

Abstract

Whole Slide Images (WSIs) are giga-pixel in scale and are typically partitioned into small instances in WSI classification pipelines for computational feasibility. However, obtaining extensive instance level annotations is costly, making few-shot weakly supervised WSI classification (FSWC) crucial for learning from limited slide-level labels. Recently, pre-trained vision-language models (VLMs) have been adopted in FSWC, yet they exhibit several limitations. Existing prompt tuning methods in FSWC substantially increase both the number of trainable parameters and inference overhead. Moreover, current methods discard instances with low alignment to text embeddings from VLMs, potentially leading to information loss. To address these challenges, we propose two key contributions. First, we introduce a new parameter efficient prompt tuning method by scaling and shifting features in text encoder, which significantly reduces the computational cost. Second, to leverage not only the pre-trained knowledge of VLMs, but also the inherent hierarchical structure of WSIs, we introduce a WSI representation learning approach with a soft hierarchical textual guidance strategy without utilizing hard instance filtering. Comprehensive evaluations on pathology datasets covering breast, lung, and ovarian cancer types demonstrate consistent improvements up-to 10.9%, 7.8%, and 13.8% respectively, over the state-of-the-art methods in FSWC. Our method reduces the number of trainable parameters by 18.1% on both breast and lung cancer datasets, and 5.8% on the ovarian cancer dataset, while also excelling at weakly-supervised tumor localization. Code at https://github.com/Jayanie/HIPSS.
Paper Structure (15 sections, 7 equations, 4 figures, 5 tables)

This paper contains 15 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Illustration of HIPSS framework. (a): Two text descriptions are generated considering the entire WSI and the regions in WSI. A set of task-specific parameters: $\gamma$ and $\beta$ are attached to a set of selected layers of the text encoder backbone, up-to a pre-defined depth. (b): WSI representation learning using two attention pooling mechanisms with textual guidance. Region encoder creates each region embedding considering the instances in the specific region. WSI encoder creates the WSI embedding by aggregating region embeddings. Attention weights are refined based on the cosine similarity between instance (or region) embeddings and the text embeddings. Contrastive learning based loss is calculated between image and text embeddings.
  • Figure 2: Variation of AUC values based on the number of tuned blocks $d_{s}$ in the text encoder with SSF. Mean AUC values are reported. The $d_{s}$ values with the highest AUC are denoted by red dots.
  • Figure 3: Variation of AUC values based on (a): the attention weights refinement factor ($\lambda$) and (b): threshold value ($\alpha$). We report the mean AUC averaged across three datasets. The $\lambda$ and $\alpha$ values with the highest AUC are denoted by red dots.
  • Figure 4: Attention map visualizations from HIPSS compared with ground truth annotation. (b): High attention regions predicted by the WSI Encoder. (c): High attention instances predicted by the region encoder. Attention maps are overlaid where red indicates instances of higher attention. Quantitative results in Table \ref{['tab:exp.segmentation']} further demonstrates the tumor localization capabilities of our method.