Table of Contents
Fetching ...

Efficient Few-Shot Medical Image Analysis via Hierarchical Contrastive Vision-Language Learning

Harrison Fuller, Fernando Gabriela Garcia, Victor Flores

TL;DR

This work introduces HiCA, a two-stage Adaptive Vision-Language Fine-tuning framework for medical imaging that combines domain-specific pretraining with hierarchical contrastive alignment across global, local, and cross-category levels. By formalizing a multi-term loss $\mathcal{L}_{\text{HiCA}}$ and incorporating ROI-based local alignment, HiCA achieves superior few-shot and zero-shot performance on Chest X-ray and Breast Ultrasound datasets, outperforming CLIP-based and conventional supervised baselines. Ablation studies confirm the necessity of each hierarchical component, while human evaluation with radiologists demonstrates improved interpretability and clinical relevance. The method also shows robust generalization to unseen categories and resilience to noisy descriptors, with practical efficiency suitable for real-world deployment. Overall, HiCA advances LVLM adaptation to medical imaging by enabling fine-grained, domain-aware vision-language alignment that improves diagnostic support under limited labeled data.

Abstract

Few-shot learning in medical image classification presents a significant challenge due to the limited availability of annotated data and the complex nature of medical imagery. In this work, we propose Adaptive Vision-Language Fine-tuning with Hierarchical Contrastive Alignment (HiCA), a novel framework that leverages the capabilities of Large Vision-Language Models (LVLMs) for medical image analysis. HiCA introduces a two-stage fine-tuning strategy, combining domain-specific pretraining and hierarchical contrastive learning to align visual and textual representations at multiple levels. We evaluate our approach on two benchmark datasets, Chest X-ray and Breast Ultrasound, achieving state-of-the-art performance in both few-shot and zero-shot settings. Further analyses demonstrate the robustness, generalizability, and interpretability of our method, with substantial improvements in performance compared to existing baselines. Our work highlights the potential of hierarchical contrastive strategies in adapting LVLMs to the unique challenges of medical imaging tasks.

Efficient Few-Shot Medical Image Analysis via Hierarchical Contrastive Vision-Language Learning

TL;DR

This work introduces HiCA, a two-stage Adaptive Vision-Language Fine-tuning framework for medical imaging that combines domain-specific pretraining with hierarchical contrastive alignment across global, local, and cross-category levels. By formalizing a multi-term loss and incorporating ROI-based local alignment, HiCA achieves superior few-shot and zero-shot performance on Chest X-ray and Breast Ultrasound datasets, outperforming CLIP-based and conventional supervised baselines. Ablation studies confirm the necessity of each hierarchical component, while human evaluation with radiologists demonstrates improved interpretability and clinical relevance. The method also shows robust generalization to unseen categories and resilience to noisy descriptors, with practical efficiency suitable for real-world deployment. Overall, HiCA advances LVLM adaptation to medical imaging by enabling fine-grained, domain-aware vision-language alignment that improves diagnostic support under limited labeled data.

Abstract

Few-shot learning in medical image classification presents a significant challenge due to the limited availability of annotated data and the complex nature of medical imagery. In this work, we propose Adaptive Vision-Language Fine-tuning with Hierarchical Contrastive Alignment (HiCA), a novel framework that leverages the capabilities of Large Vision-Language Models (LVLMs) for medical image analysis. HiCA introduces a two-stage fine-tuning strategy, combining domain-specific pretraining and hierarchical contrastive learning to align visual and textual representations at multiple levels. We evaluate our approach on two benchmark datasets, Chest X-ray and Breast Ultrasound, achieving state-of-the-art performance in both few-shot and zero-shot settings. Further analyses demonstrate the robustness, generalizability, and interpretability of our method, with substantial improvements in performance compared to existing baselines. Our work highlights the potential of hierarchical contrastive strategies in adapting LVLMs to the unique challenges of medical imaging tasks.
Paper Structure (21 sections, 5 equations, 6 tables)