Efficient Few-Shot Medical Image Analysis via Hierarchical Contrastive Vision-Language Learning

Harrison Fuller; Fernando Gabriela Garcia; Victor Flores

Efficient Few-Shot Medical Image Analysis via Hierarchical Contrastive Vision-Language Learning

Harrison Fuller, Fernando Gabriela Garcia, Victor Flores

TL;DR

This work introduces HiCA, a two-stage Adaptive Vision-Language Fine-tuning framework for medical imaging that combines domain-specific pretraining with hierarchical contrastive alignment across global, local, and cross-category levels. By formalizing a multi-term loss $\mathcal{L}_{\text{HiCA}}$ and incorporating ROI-based local alignment, HiCA achieves superior few-shot and zero-shot performance on Chest X-ray and Breast Ultrasound datasets, outperforming CLIP-based and conventional supervised baselines. Ablation studies confirm the necessity of each hierarchical component, while human evaluation with radiologists demonstrates improved interpretability and clinical relevance. The method also shows robust generalization to unseen categories and resilience to noisy descriptors, with practical efficiency suitable for real-world deployment. Overall, HiCA advances LVLM adaptation to medical imaging by enabling fine-grained, domain-aware vision-language alignment that improves diagnostic support under limited labeled data.

Abstract

Few-shot learning in medical image classification presents a significant challenge due to the limited availability of annotated data and the complex nature of medical imagery. In this work, we propose Adaptive Vision-Language Fine-tuning with Hierarchical Contrastive Alignment (HiCA), a novel framework that leverages the capabilities of Large Vision-Language Models (LVLMs) for medical image analysis. HiCA introduces a two-stage fine-tuning strategy, combining domain-specific pretraining and hierarchical contrastive learning to align visual and textual representations at multiple levels. We evaluate our approach on two benchmark datasets, Chest X-ray and Breast Ultrasound, achieving state-of-the-art performance in both few-shot and zero-shot settings. Further analyses demonstrate the robustness, generalizability, and interpretability of our method, with substantial improvements in performance compared to existing baselines. Our work highlights the potential of hierarchical contrastive strategies in adapting LVLMs to the unique challenges of medical imaging tasks.

Efficient Few-Shot Medical Image Analysis via Hierarchical Contrastive Vision-Language Learning

TL;DR

and incorporating ROI-based local alignment, HiCA achieves superior few-shot and zero-shot performance on Chest X-ray and Breast Ultrasound datasets, outperforming CLIP-based and conventional supervised baselines. Ablation studies confirm the necessity of each hierarchical component, while human evaluation with radiologists demonstrates improved interpretability and clinical relevance. The method also shows robust generalization to unseen categories and resilience to noisy descriptors, with practical efficiency suitable for real-world deployment. Overall, HiCA advances LVLM adaptation to medical imaging by enabling fine-grained, domain-aware vision-language alignment that improves diagnostic support under limited labeled data.

Abstract

Paper Structure (21 sections, 5 equations, 6 tables)

This paper contains 21 sections, 5 equations, 6 tables.

Introduction
Related Work
Medical Image Classification
Medical Large Vision-Language Models
Method
Problem Formulation
Hierarchical Contrastive Learning
Global Alignment Loss
Local Alignment Loss
Cross-Category Separation Loss
Training Procedure
Experiments
Experimental Setup
Comparison with State-of-the-Art Methods
Effectiveness of Hierarchical Contrastive Learning
...and 6 more sections

Efficient Few-Shot Medical Image Analysis via Hierarchical Contrastive Vision-Language Learning

TL;DR

Abstract

Efficient Few-Shot Medical Image Analysis via Hierarchical Contrastive Vision-Language Learning

Authors

TL;DR

Abstract

Table of Contents