Table of Contents
Fetching ...

Libra-MIL: Multimodal Prototypes Stereoscopic Infused with Task-specific Language Priors for Few-shot Whole Slide Image Classification

Zhenfeng Zhuang, Fangyu Zhou, Liansheng Wang

TL;DR

Libra-MIL tackles the core challenge of few-shot whole-slide image classification under weak supervision by introducing task-specific language priors generated from frozen LLMs and a bidirectional, prototype-based fusion framework. It constructs both visual and text prototypes and fuses their similarities through Stereoscopic Optimal Transport to form a unified, structure-aware embedding space, enabling robust cross-modal reasoning. Across three cancer datasets and multiple shot settings, Libra-MIL achieves superior generalization over state-of-the-art methods and provides prototype-based interpretability that highlights task-relevant histology features. The approach broadens computational pathology capabilities by combining task-aware textual priors with multimodal prototypes in a principled, transport-driven fusion scheme.

Abstract

While Large Language Models (LLMs) are emerging as a promising direction in computational pathology, the substantial computational cost of giga-pixel Whole Slide Images (WSIs) necessitates the use of Multi-Instance Learning (MIL) to enable effective modeling. A key challenge is that pathological tasks typically provide only bag-level labels, while instance-level descriptions generated by LLMs often suffer from bias due to a lack of fine-grained medical knowledge. To address this, we propose that constructing task-specific pathological entity prototypes is crucial for learning generalizable features and enhancing model interpretability. Furthermore, existing vision-language MIL methods often employ unidirectional guidance, limiting cross-modal synergy. In this paper, we introduce a novel approach, Multimodal Prototype-based Multi-Instance Learning, that promotes bidirectional interaction through a balanced information compression scheme. Specifically, we leverage a frozen LLM to generate task-specific pathological entity descriptions, which are learned as text prototypes. Concurrently, the vision branch learns instance-level prototypes to mitigate the model's reliance on redundant data. For the fusion stage, we employ the Stereoscopic Optimal Transport (SOT) algorithm, which is based on a similarity metric, thereby facilitating broader semantic alignment in a higher-dimensional space. We conduct few-shot classification and explainability experiments on three distinct cancer datasets, and the results demonstrate the superior generalization capabilities of our proposed method.

Libra-MIL: Multimodal Prototypes Stereoscopic Infused with Task-specific Language Priors for Few-shot Whole Slide Image Classification

TL;DR

Libra-MIL tackles the core challenge of few-shot whole-slide image classification under weak supervision by introducing task-specific language priors generated from frozen LLMs and a bidirectional, prototype-based fusion framework. It constructs both visual and text prototypes and fuses their similarities through Stereoscopic Optimal Transport to form a unified, structure-aware embedding space, enabling robust cross-modal reasoning. Across three cancer datasets and multiple shot settings, Libra-MIL achieves superior generalization over state-of-the-art methods and provides prototype-based interpretability that highlights task-relevant histology features. The approach broadens computational pathology capabilities by combining task-aware textual priors with multimodal prototypes in a principled, transport-driven fusion scheme.

Abstract

While Large Language Models (LLMs) are emerging as a promising direction in computational pathology, the substantial computational cost of giga-pixel Whole Slide Images (WSIs) necessitates the use of Multi-Instance Learning (MIL) to enable effective modeling. A key challenge is that pathological tasks typically provide only bag-level labels, while instance-level descriptions generated by LLMs often suffer from bias due to a lack of fine-grained medical knowledge. To address this, we propose that constructing task-specific pathological entity prototypes is crucial for learning generalizable features and enhancing model interpretability. Furthermore, existing vision-language MIL methods often employ unidirectional guidance, limiting cross-modal synergy. In this paper, we introduce a novel approach, Multimodal Prototype-based Multi-Instance Learning, that promotes bidirectional interaction through a balanced information compression scheme. Specifically, we leverage a frozen LLM to generate task-specific pathological entity descriptions, which are learned as text prototypes. Concurrently, the vision branch learns instance-level prototypes to mitigate the model's reliance on redundant data. For the fusion stage, we employ the Stereoscopic Optimal Transport (SOT) algorithm, which is based on a similarity metric, thereby facilitating broader semantic alignment in a higher-dimensional space. We conduct few-shot classification and explainability experiments on three distinct cancer datasets, and the results demonstrate the superior generalization capabilities of our proposed method.

Paper Structure

This paper contains 58 sections, 42 equations, 8 figures, 7 tables, 2 algorithms.

Figures (8)

  • Figure 1: Brief comparison with related MIL methods. a) Basic MIL simply performs aggregation and sorting operations. b) Prototype MIL learns class-aware logits via instance similarity directly. c) VLMIL fused LLM Priors with the text-guided similarity. d) Libra-MIL uses Multimodal prototype learning with Stereoscopic Optimal Transport (SOT) on similarities.
  • Figure 2: Overview of Libra-MIL. The framework first employs Vision Preprocessing and Task-specific Language Priors Generation modules to perform modality-specific preprocessing and prior embedding. The dual-prototype multimodal learner then integrates instance-level representations from both modalities into a unified similarity space through prototype-based modeling and Sinkhorn optimal transport.
  • Figure 3: Results of different LLMs' priors on TCGA-NSCLC under 4-shot setting.
  • Figure 4: Case study of multimodal prototypes with different histological morphologies and semantic attention on WSIs.
  • Figure 5: Gradient-based Contribution of textual and visual prototypes.
  • ...and 3 more figures