Table of Contents
Fetching ...

Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation

Shahad Albastaki, Anabia Sohail, Iyyakutti Iyappan Ganapathi, Basit Alawode, Asim Khan, Sajid Javed, Naoufel Werghi, Mohammed Bennamoun, Arif Mahmood

TL;DR

This work introduces MR-PLIP, a multi-resolution vision-language pre-training framework for computational pathology that learns text-guided visual representations across multiple magnifications. It introduces two novel cross-resolution learning components: CVTA, which aligns visual patches with a keyword bag derived from multi-resolution captions, and MRTVA, which enforces consistency between text-guided visual features across parent-child resolution pairs via a SimSiam-inspired objective. Pre-trained on 34 million image-text pairs from TCGA spanning four magnifications, MR-PLIP achieves state-of-the-art performance on 26 histopathology benchmarks across tile- and WSI-level classification, segmentation, nuclei segmentation, and cross-modal retrieval, including strong zero-shot and linear-probe results. The findings highlight the value of explicitly modeling multi-scale context in CPath and suggest broad potential for extending the approach to other medical imaging modalities and diagnostic tasks.

Abstract

In Computational Pathology (CPath), the introduction of Vision-Language Models (VLMs) has opened new avenues for research, focusing primarily on aligning image-text pairs at a single magnification level. However, this approach might not be sufficient for tasks like cancer subtype classification, tissue phenotyping, and survival analysis due to the limited level of detail that a single-resolution image can provide. Addressing this, we propose a novel multi-resolution paradigm leveraging Whole Slide Images (WSIs) to extract histology patches at multiple resolutions and generate corresponding textual descriptions through advanced CPath VLM. We introduce visual-textual alignment at multiple resolutions as well as cross-resolution alignment to establish more effective text-guided visual representations. Cross-resolution alignment using a multimodal encoder enhances the model's ability to capture context from multiple resolutions in histology images. Our model aims to capture a broader range of information, supported by novel loss functions, enriches feature representation, improves discriminative ability, and enhances generalization across different resolutions. Pre-trained on a comprehensive TCGA dataset with 34 million image-language pairs at various resolutions, our fine-tuned model outperforms state-of-the-art (SOTA) counterparts across multiple datasets and tasks, demonstrating its effectiveness in CPath. The code is available on GitHub at: https://github.com/BasitAlawode/MR-PLIP

Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation

TL;DR

This work introduces MR-PLIP, a multi-resolution vision-language pre-training framework for computational pathology that learns text-guided visual representations across multiple magnifications. It introduces two novel cross-resolution learning components: CVTA, which aligns visual patches with a keyword bag derived from multi-resolution captions, and MRTVA, which enforces consistency between text-guided visual features across parent-child resolution pairs via a SimSiam-inspired objective. Pre-trained on 34 million image-text pairs from TCGA spanning four magnifications, MR-PLIP achieves state-of-the-art performance on 26 histopathology benchmarks across tile- and WSI-level classification, segmentation, nuclei segmentation, and cross-modal retrieval, including strong zero-shot and linear-probe results. The findings highlight the value of explicitly modeling multi-scale context in CPath and suggest broad potential for extending the approach to other medical imaging modalities and diagnostic tasks.

Abstract

In Computational Pathology (CPath), the introduction of Vision-Language Models (VLMs) has opened new avenues for research, focusing primarily on aligning image-text pairs at a single magnification level. However, this approach might not be sufficient for tasks like cancer subtype classification, tissue phenotyping, and survival analysis due to the limited level of detail that a single-resolution image can provide. Addressing this, we propose a novel multi-resolution paradigm leveraging Whole Slide Images (WSIs) to extract histology patches at multiple resolutions and generate corresponding textual descriptions through advanced CPath VLM. We introduce visual-textual alignment at multiple resolutions as well as cross-resolution alignment to establish more effective text-guided visual representations. Cross-resolution alignment using a multimodal encoder enhances the model's ability to capture context from multiple resolutions in histology images. Our model aims to capture a broader range of information, supported by novel loss functions, enriches feature representation, improves discriminative ability, and enhances generalization across different resolutions. Pre-trained on a comprehensive TCGA dataset with 34 million image-language pairs at various resolutions, our fine-tuned model outperforms state-of-the-art (SOTA) counterparts across multiple datasets and tasks, demonstrating its effectiveness in CPath. The code is available on GitHub at: https://github.com/BasitAlawode/MR-PLIP

Paper Structure

This paper contains 49 sections, 11 equations, 9 figures, 17 tables.

Figures (9)

  • Figure 1: (a) Existing CPath VLMs huang2023visuallu2023visual. (b) Our Proposed Multi-Resolution Pathology-Language Pre-training (MR-PLIP) model, and (c) shows the zero-shot classification performance comparison of MR-PLIP and SOTA VLMs in terms of weighted average $F_{1}$ score on publicly available benchmark datasets. MR-PLIP outperforms existing SOTA methods.
  • Figure 2: Analysis of multi-resolution input WSI (a) using Quilt-LLaVA seyfioglu2024quilt. Exemplar histology patches (b)-(e) are shown at different magnifications, demonstrating how higher magnification (5$\times$ to 40$\times$) shifts focus from contextual to detailed information. Textual descriptions generated by QuiltLLaVA vary, reflecting the change in textual details observed at each magnification level.
  • Figure 3: Comparative analysis of zero-shot tile-based classification balanced accuracy of QuiltNet ikezogwo2024quilt and our proposed MR-PLIP across seven independent datasets. Both models, pre-trained on the TCGA dataset using 34M image-text pairs, evaluate the impact of fine-tuning language encoders at varying magnification levels$-$5$\times$, 10$\times$, 20$\times$, and 40$\times$. The variations in performance across these magnifications highlight the necessity for a multi-resolution VLM in CPath to enhance generalization capabilities.
  • Figure 4: Outline of the proposed MR-PLIP algorithm workflow. Starting with step (a), where multi-resolution histology patches are extracted. In step (b), these patches are compiled for each resolution into a visual bag. Step (c) involves the extraction of textual descriptions for the patches using Quilt-LLaVA seyfioglu2024quilt. Steps (d)-(f) are focused on generating matching embeddings for the visual and textual data via visual encoder (UNI chen2024towards) and text encoder (QuiltNet ikezogwo2024quilt), respectively. The high similarity between the visual and textual embeddings allows for the identification of the top-$k_{o}$ positive words for each patch in step (g). Lastly, steps (h)-(i) process these top-$k_{o}$ words and visual features through a multi-modal encoder, designed to uncover spatial interrelations in the multi-modal context.
  • Figure 5: Zero-shot tile-based classification performance in terms of accuracy of SOTA VL models including (a) MI-Zero lu2023visual, (b) QuiltNet ikezogwo2024quilt, (c) BiomedCLIP zhang2024biomedclip, and (d) PLIP huang2023visual on testing splits of seven independent datasets. Both models are pre-trained on the TCGA dataset using patches from 20k WSIs. In each experiment, the pre-trained vision-language encoders are fine-tuned on a fixed magnification level 5$\times$, 10$\times$, 20$\times$, or 40$\times$. Performance variations across different magnification levels show the need for a multi-resolution VL model in computational pathology for improved generalization capabilities.
  • ...and 4 more figures