Table of Contents
Fetching ...

Boosting Medical Visual Understanding From Multi-Granular Language Learning

Zihan Li, Yiqing Wang, Sina Farsiu, Paul Kinahan

TL;DR

This work introduces Multi-Granular Language Learning (MGLL), a plug-and-play contrastive framework designed for medical visual understanding that achieves simultaneous multi-label and cross-granularity alignment. MGLL extends CLIP with soft-label supervision (soft CLIP loss), a point-wise per-pair loss, and a smooth KL divergence loss to ensure cross-granularity consistency, without adding extra encoders. It builds two large multi-granular datasets, MGLL-Fundus and MGLL-Xray, enabling robust pretraining on fundus and chest-imaging modalities. Empirical results across eleven downstream datasets show MGLL consistently outperforms state-of-the-art baselines, improves performance of multimodal large language models, and demonstrates robustness to image/text quality variations. The work provides detailed ablations, theoretical insights into the optimality of the MGLL losses, and reproducibility resources, underscoring MGLL’s potential to advance medical vision-language pretraining and generalization to other hierarchical domains.

Abstract

Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at \href{https://github.com/HUANGLIZI/MGLL}{https://github.com/HUANGLIZI/MGLL}.

Boosting Medical Visual Understanding From Multi-Granular Language Learning

TL;DR

This work introduces Multi-Granular Language Learning (MGLL), a plug-and-play contrastive framework designed for medical visual understanding that achieves simultaneous multi-label and cross-granularity alignment. MGLL extends CLIP with soft-label supervision (soft CLIP loss), a point-wise per-pair loss, and a smooth KL divergence loss to ensure cross-granularity consistency, without adding extra encoders. It builds two large multi-granular datasets, MGLL-Fundus and MGLL-Xray, enabling robust pretraining on fundus and chest-imaging modalities. Empirical results across eleven downstream datasets show MGLL consistently outperforms state-of-the-art baselines, improves performance of multimodal large language models, and demonstrates robustness to image/text quality variations. The work provides detailed ablations, theoretical insights into the optimality of the MGLL losses, and reproducibility resources, underscoring MGLL’s potential to advance medical vision-language pretraining and generalization to other hierarchical domains.

Abstract

Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at \href{https://github.com/HUANGLIZI/MGLL}{https://github.com/HUANGLIZI/MGLL}.

Paper Structure

This paper contains 49 sections, 48 equations, 6 figures, 38 tables.

Figures (6)

  • Figure 1: The illustrative comparison of input and outcome between CLIP and MGLL.
  • Figure 2: The overview of MGLL (Multi-Granular Language Learning) pretraining pipeline.
  • Figure 3: The quantitative comparison (AUC) between baseline methods and proposed MGLL on nine fundus downstream datasets.
  • Figure 4: The Class Activation Maps of different diseases from CLIP and MGLL.
  • Figure 5: Case Studies (Top: Case 1, Bottom: Case 2) Demonstrating MGLL Integration Impact on Diagnostic Accuracy of Different Multimodal Large Langue Models (MLLMs).
  • ...and 1 more figures