Table of Contents
Fetching ...

ACE-LoRA: Graph-Attentive Context Enhancement for Parameter-Efficient Adaptation of Medical Vision-Language Models

M. Arda Aydın, Melih B. Yilmaz, Aykut Koç, Tolga Çukur

Abstract

The success of CLIP-like vision-language models (VLMs) on natural images has inspired medical counterparts, yet existing approaches largely fall into two extremes: specialist models trained on single-domain data, which capture domain-specific details but generalize poorly, and generalist medical VLMs trained on multi-domain data, which retain broad semantics but dilute fine-grained diagnostic cues. Bridging this specialization-generalization trade-off remains challenging. To address this problem, we propose ACE-LoRA, a parameter-efficient adaptation framework for generalist medical VLMs that maintains robust zero-shot generalization. ACE-LoRA integrates Low-Rank Adaptation (LoRA) modules into frozen image-text encoders and introduces an Attention-based Context Enhancement Hypergraph Neural Network (ACE-HGNN) module that captures higher-order contextual interactions beyond pairwise similarity to enrich global representations with localized diagnostic cues, addressing a key limitation of prior Parameter-Efficient Fine-Tuning (PEFT) methods that overlook fine-grained details. To further enhance cross-modal alignment, we formulate a label-guided InfoNCE loss to effectively suppress false negatives between semantically related image-text pairs. Despite adding only 0.95M trainable parameters, ACE-LoRA consistently outperforms state-of-the-art medical VLMs and PEFT baselines across zero-shot classification, segmentation, and detection benchmarks spanning multiple domains. Our code is available at https://github.com/icon-lab/ACE-LoRA.

ACE-LoRA: Graph-Attentive Context Enhancement for Parameter-Efficient Adaptation of Medical Vision-Language Models

Abstract

The success of CLIP-like vision-language models (VLMs) on natural images has inspired medical counterparts, yet existing approaches largely fall into two extremes: specialist models trained on single-domain data, which capture domain-specific details but generalize poorly, and generalist medical VLMs trained on multi-domain data, which retain broad semantics but dilute fine-grained diagnostic cues. Bridging this specialization-generalization trade-off remains challenging. To address this problem, we propose ACE-LoRA, a parameter-efficient adaptation framework for generalist medical VLMs that maintains robust zero-shot generalization. ACE-LoRA integrates Low-Rank Adaptation (LoRA) modules into frozen image-text encoders and introduces an Attention-based Context Enhancement Hypergraph Neural Network (ACE-HGNN) module that captures higher-order contextual interactions beyond pairwise similarity to enrich global representations with localized diagnostic cues, addressing a key limitation of prior Parameter-Efficient Fine-Tuning (PEFT) methods that overlook fine-grained details. To further enhance cross-modal alignment, we formulate a label-guided InfoNCE loss to effectively suppress false negatives between semantically related image-text pairs. Despite adding only 0.95M trainable parameters, ACE-LoRA consistently outperforms state-of-the-art medical VLMs and PEFT baselines across zero-shot classification, segmentation, and detection benchmarks spanning multiple domains. Our code is available at https://github.com/icon-lab/ACE-LoRA.
Paper Structure (24 sections, 13 equations, 8 figures, 12 tables)

This paper contains 24 sections, 13 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Overview of ACE-LoRA. ACE-LoRA integrates low-rank adaptation modules into self-attention blocks of image and text encoders in a generalist medical VLM, and introduces ACE-HGNN, a hypergraph-based module that models high-order topological dependencies between local (e.g., image patches or report snippets) and global embeddings. For clarity, ACE-HGNN is described using image embeddings, though the same procedure is applied to text embeddings.
  • Figure 2: False negatives in contrastive learning. The CLIP loss treats all non-matching pairs as negatives, which can falsely push apart semantically similar samples, whereas our formulation avoids separating pairs that share the same disease label.
  • Figure 3: Comparison of cross-modal similarity maps on the RSNA Dataset. Similarity maps show the correspondence between image regions and the text query "Pneumonia", with red boxes indicating ground-truth (GT) abnormal regions.
  • Figure A.1: Average zero-shot accuracy across three CXR benchmarks vs. number of trainable parameters (log-scale). The bubble size denotes the computational cost in GFLOPs during the forward pass.
  • Figure A.2: Zero-shot accuracy across three datasets for varying $k$ values. We find that selecting $k=5$ yields optimal performance.
  • ...and 3 more figures