Can language-guided unsupervised adaptation improve medical image classification using unpaired images and texts?

Umaima Rahman; Raza Imam; Mohammad Yaqub; Boulbaba Ben Amor; Dwarikanath Mahapatra

Can language-guided unsupervised adaptation improve medical image classification using unpaired images and texts?

Umaima Rahman, Raza Imam, Mohammad Yaqub, Boulbaba Ben Amor, Dwarikanath Mahapatra

TL;DR

Medical image classification is hampered by scarce labeled data, while disease-related text is more abundant. MedUnA presents a two-stage, language-guided unsupervised adaptation of Vision-Language Models that uses unpaired images and LLM-generated class descriptions to train a cross-modal adapter and a learnable prompt, then refines them with entropy-based unsupervised training. Key contributions include label-free tuning via textual descriptions, test-time unsupervised adaptation to align visual and textual embeddings, and substantial accuracy gains across five datasets with MedCLIP outperforming CLIP baselines. This approach enhances scalability and generalization, enabling effective classification for novel disease classes without requiring large paired image-text datasets.

Abstract

In medical image classification, supervised learning is challenging due to the scarcity of labeled medical images. To address this, we leverage the visual-textual alignment within Vision-Language Models (VLMs) to enable unsupervised learning of a medical image classifier. In this work, we propose \underline{Med}ical \underline{Un}supervised \underline{A}daptation (\texttt{MedUnA}) of VLMs, where the LLM-generated descriptions for each class are encoded into text embeddings and matched with class labels via a cross-modal adapter. This adapter attaches to a visual encoder of \texttt{MedCLIP} and aligns the visual embeddings through unsupervised learning, driven by a contrastive entropy-based loss and prompt tuning. Thereby, improving performance in scenarios where textual information is more abundant than labeled images, particularly in the healthcare domain. Unlike traditional VLMs, \texttt{MedUnA} uses \textbf{unpaired images and text} for learning representations and enhances the potential of VLMs beyond traditional constraints. We evaluate the performance on three chest X-ray datasets and two multi-class datasets (diabetic retinopathy and skin lesions), showing significant accuracy gains over the zero-shot baseline. Our code is available at https://github.com/rumaima/meduna.

Can language-guided unsupervised adaptation improve medical image classification using unpaired images and texts?

TL;DR

Abstract

Paper Structure (10 sections, 5 equations, 5 figures, 3 tables)

This paper contains 10 sections, 5 equations, 5 figures, 3 tables.

Introduction
Methodology
Adapter Pre-training
Unsupervised Training
Inference
Experiments and Results
Experimentation Details
Results and Discussion
Conclusion
Compliance with Ethical Standards

Figures (5)

Figure 1: The MedUnA framework: (a) Adapter Pre-training: A textual classifier is trained to classify the LLM-generated descriptions for a disease. (b) Unsupervised Training: The trained textual classifier & a learnable prompt vector for an unpaired image embedding is trained in an unsupervised regime. (c) Inference: The tuned textual classifier & the prompt vector are used to get predictions on the test dataset.
Figure 2: Text classifier accuracy across different datasets when descriptions are generated by different LLMs.
Figure 3: Tuning improvement with MedCLIP-Swin Zero-Shot as benchmark. ---- denotes the zero-shot CLIP.
Figure 4: Comparison of top-$k$ confident matches between the unpaired textual and visual embeddings for IDRID idridporwal2018indian dataset.
Figure 5: t-SNE plot of zero-shot MedCLIPvs. our method MedUnA for ISIC dataset having 7 classes.

Can language-guided unsupervised adaptation improve medical image classification using unpaired images and texts?

TL;DR

Abstract

Can language-guided unsupervised adaptation improve medical image classification using unpaired images and texts?

Authors

TL;DR

Abstract

Table of Contents

Figures (5)