Table of Contents
Fetching ...

ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification

Jiangbo Shi, Chen Li, Tieliang Gong, Yefeng Zheng, Huazhu Fu

TL;DR

ViLa-MIL addresses the challenge of few-shot WSI classification by integrating a frozen large language model to generate dual-scale visual descriptive prompts that guide a CLIP-based vision-language framework. It introduces a prototype-guided patch decoder to progressively fuse patch features and a context-guided text decoder to refine text features using multi-granular image context. The approach delivers state-of-the-art performance on three multi-center cancer subtyping datasets under few-shot conditions, with ablations isolating the contribution of each component. The work highlights the practical potential of injecting language priors to efficiently transfer large pre-trained models to digital pathology and improve generalization across centers.

Abstract

Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI) with giga-pixel size and hierarchical image context in digital pathology. However, these methods heavily depend on a substantial number of bag-level labels and solely learn from the original slides, which are easily affected by variations in data distribution. Recently, vision language model (VLM)-based methods introduced the language prior by pre-training on large-scale pathological image-text pairs. However, the previous text prompt lacks the consideration of pathological prior knowledge, therefore does not substantially boost the model's performance. Moreover, the collection of such pairs and the pre-training process are very time-consuming and source-intensive.To solve the above problems, we propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification. Specifically, we propose a dual-scale visual descriptive text prompt based on the frozen large language model (LLM) to boost the performance of VLM effectively. To transfer the VLM to process WSI efficiently, for the image branch, we propose a prototype-guided patch decoder to aggregate the patch features progressively by grouping similar patches into the same prototype; for the text branch, we introduce a context-guided text decoder to enhance the text features by incorporating the multi-granular image contexts. Extensive studies on three multi-cancer and multi-center subtyping datasets demonstrate the superiority of ViLa-MIL.

ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification

TL;DR

ViLa-MIL addresses the challenge of few-shot WSI classification by integrating a frozen large language model to generate dual-scale visual descriptive prompts that guide a CLIP-based vision-language framework. It introduces a prototype-guided patch decoder to progressively fuse patch features and a context-guided text decoder to refine text features using multi-granular image context. The approach delivers state-of-the-art performance on three multi-center cancer subtyping datasets under few-shot conditions, with ablations isolating the contribution of each component. The work highlights the practical potential of injecting language priors to efficiently transfer large pre-trained models to digital pathology and improve generalization across centers.

Abstract

Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI) with giga-pixel size and hierarchical image context in digital pathology. However, these methods heavily depend on a substantial number of bag-level labels and solely learn from the original slides, which are easily affected by variations in data distribution. Recently, vision language model (VLM)-based methods introduced the language prior by pre-training on large-scale pathological image-text pairs. However, the previous text prompt lacks the consideration of pathological prior knowledge, therefore does not substantially boost the model's performance. Moreover, the collection of such pairs and the pre-training process are very time-consuming and source-intensive.To solve the above problems, we propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification. Specifically, we propose a dual-scale visual descriptive text prompt based on the frozen large language model (LLM) to boost the performance of VLM effectively. To transfer the VLM to process WSI efficiently, for the image branch, we propose a prototype-guided patch decoder to aggregate the patch features progressively by grouping similar patches into the same prototype; for the text branch, we introduce a context-guided text decoder to enhance the text features by incorporating the multi-granular image contexts. Extensive studies on three multi-cancer and multi-center subtyping datasets demonstrate the superiority of ViLa-MIL.

Paper Structure

This paper contains 26 sections, 9 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Comparison of our ViLa-MIL with existing MIL- and VLM-based methods. (a) MIL-based methods design various aggregation functions to generate the slide-level features; (b) VLM-based methods calculate the similarity between patches and candidate text prompts, then utilize operator like top-K to obtain the slide-level prediction; (c) Our ViLa-MIL aligns the dual-scale slide-level features and the visual descriptive text prompt to obtain the slide prediction efficiently. Note that, for simplicity, only the single-scale data stream of ViLa-MIL is visualized.
  • Figure 2: Pipeline of the proposed ViLa-MIL framework. The input of ViLa-MIL is a Question and WSI. The question is passed through a frozen large language model (LLM) to generate the dual-scale visual descriptive text prompt. The prototype-guided patch decoder is introduced to progressively fuse the patch features into the slide features. The context-guided text decoder is introduced to refine the text features further by utilizing the multi-granular image contexts.
  • Figure 3: (a) Prototype-guided patch decoder; (b) Context-guided text decoder.
  • Figure 4: Slide-level feature clustering results of different methods on the TCGA-RCC (top) and TCGA-Lung (bottom) datasets.
  • Figure 5: Interpretability analysis (yellow for cancer) of several exemplars from the TIHD-RCC and TCGA-RCC datasets.
  • ...and 5 more figures