PathM3: A Multimodal Multi-Task Multiple Instance Learning Framework for Whole Slide Image Classification and Captioning

Qifeng Zhou; Wenliang Zhong; Yuzhi Guo; Michael Xiao; Hehuan Ma; Junzhou Huang

PathM3: A Multimodal Multi-Task Multiple Instance Learning Framework for Whole Slide Image Classification and Captioning

Qifeng Zhou, Wenliang Zhong, Yuzhi Guo, Michael Xiao, Hehuan Ma, Junzhou Huang

TL;DR

PathM3 tackles the challenge of aligning gigapixel WSIs with scarce WSI-level captions by proposing a multimodal MIL framework that uses a frozen image encoder, a correlation module with Nyström attention to aggregate patch features, and a query-based transformer to fuse WSI visuals with captions. The model jointly optimizes classification and captioning via a multi-task objective, $L_{overall} = \alpha L_C + (1-\alpha) L_G$, leveraging limited captions through shared multimodal learning and a frozen LLM for generation. Empirical results on PatchGastric show state-of-the-art performance for both WSI classification (86.40% accuracy with image+text) and captioning metrics (BLEU@4 0.520, METEOR 0.394, SPICE 0.591), with ablations confirming the critical role of the correlation module and multi-task learning. These findings demonstrate data-efficient, interpretable multimodal histopathology analysis that better leverages WSI context and expert captions for diagnostic support.

Abstract

In the field of computational histopathology, both whole slide images (WSIs) and diagnostic captions provide valuable insights for making diagnostic decisions. However, aligning WSIs with diagnostic captions presents a significant challenge. This difficulty arises from two main factors: 1) Gigapixel WSIs are unsuitable for direct input into deep learning models, and the redundancy and correlation among the patches demand more attention; and 2) Authentic WSI diagnostic captions are extremely limited, making it difficult to train an effective model. To overcome these obstacles, we present PathM3, a multimodal, multi-task, multiple instance learning (MIL) framework for WSI classification and captioning. PathM3 adapts a query-based transformer to effectively align WSIs with diagnostic captions. Given that histopathology visual patterns are redundantly distributed across WSIs, we aggregate each patch feature with MIL method that considers the correlations among instances. Furthermore, our PathM3 overcomes data scarcity in WSI-level captions by leveraging limited WSI diagnostic caption data in the manner of multi-task joint learning. Extensive experiments with improved classification accuracy and caption generation demonstrate the effectiveness of our method on both WSI classification and captioning task.

PathM3: A Multimodal Multi-Task Multiple Instance Learning Framework for Whole Slide Image Classification and Captioning

TL;DR

, leveraging limited captions through shared multimodal learning and a frozen LLM for generation. Empirical results on PatchGastric show state-of-the-art performance for both WSI classification (86.40% accuracy with image+text) and captioning metrics (BLEU@4 0.520, METEOR 0.394, SPICE 0.591), with ablations confirming the critical role of the correlation module and multi-task learning. These findings demonstrate data-efficient, interpretable multimodal histopathology analysis that better leverages WSI context and expert captions for diagnostic support.

Abstract

Paper Structure (14 sections, 4 equations, 2 figures, 6 tables)

This paper contains 14 sections, 4 equations, 2 figures, 6 tables.

Introduction
Related Work
Method
Problem Formulation
Correlation of each instance
WSI and Caption Fusion
Multi-task Joint Learning
Experiments and Results
Dataset
Comparison with state-of-the-art methods
Ablational studies
Conclusion
Acknowledgements.
Disclosure of Interests.

Figures (2)

Figure 1: PathM3 overview. A WSI is fed into a frozen image encoder to generate image embeddings. These embeddings then pass through a correlation module before being fed into a query-based transformer, in which the learnable query embeddings interact with the textual embeddings using self-attention and with image embeddings using cross-attention. The outputs of these queries are then utilized for classification via a linear classifier and for generating captions with a frozen LLM.
Figure 2: Visualization of high attention score patches of each subtype. For each subtype, the top 5 patches with the highest attention scores are chosen. A board-certified pathologist confirms that PathM3 selects relevant morphological patterns for each subtype with high attention scores.

PathM3: A Multimodal Multi-Task Multiple Instance Learning Framework for Whole Slide Image Classification and Captioning

TL;DR

Abstract

PathM3: A Multimodal Multi-Task Multiple Instance Learning Framework for Whole Slide Image Classification and Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (2)