Table of Contents
Fetching ...

Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning

Aofei Chang, Le Huang, Alex James Boyd, Parminder Bhatia, Taha Kass-Hout, Cao Xiao, Fenglong Ma

TL;DR

This work tackles hallucination and attention misalignment in medical vision-language models by introducing A$^3$Tune, a training-time framework that automatically aligns attention to diagnostically critical regions. It combines prompt-aware weak labels generated from SAM and refined with BioMedCLIP, selective tuning of visually-critical attention heads, and a novel A$^3$MoE mechanism to adapt parameters across prompts and images. The approach is coupled with a mask-based attention objective and a final objective that balances language modeling with alignment, yielding superior performance on medical VQA and report generation benchmarks and improved visual grounding. The method demonstrates strong generalization across datasets and models, suggesting a practical path toward more reliable and interpretable Med-LVLMs in clinical settings.

Abstract

Medical Large Vision-Language Models (Med-LVLMs) often exhibit suboptimal attention distribution on visual inputs, leading to hallucinated or inaccurate outputs. Existing mitigation methods primarily rely on inference-time interventions, which are limited in attention adaptation or require additional supervision. To address this, we propose A$^3$Tune, a novel fine-tuning framework for Automatic Attention Alignment Tuning. A$^3$Tune leverages zero-shot weak labels from SAM, refines them into prompt-aware labels using BioMedCLIP, and then selectively modifies visually-critical attention heads to improve alignment while minimizing interference. Additionally, we introduce a A$^3$MoE module, enabling adaptive parameter selection for attention tuning across diverse prompts and images. Extensive experiments on medical VQA and report generation benchmarks show that A$^3$Tune outperforms state-of-the-art baselines, achieving enhanced attention distributions and performance in Med-LVLMs.

Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning

TL;DR

This work tackles hallucination and attention misalignment in medical vision-language models by introducing ATune, a training-time framework that automatically aligns attention to diagnostically critical regions. It combines prompt-aware weak labels generated from SAM and refined with BioMedCLIP, selective tuning of visually-critical attention heads, and a novel AMoE mechanism to adapt parameters across prompts and images. The approach is coupled with a mask-based attention objective and a final objective that balances language modeling with alignment, yielding superior performance on medical VQA and report generation benchmarks and improved visual grounding. The method demonstrates strong generalization across datasets and models, suggesting a practical path toward more reliable and interpretable Med-LVLMs in clinical settings.

Abstract

Medical Large Vision-Language Models (Med-LVLMs) often exhibit suboptimal attention distribution on visual inputs, leading to hallucinated or inaccurate outputs. Existing mitigation methods primarily rely on inference-time interventions, which are limited in attention adaptation or require additional supervision. To address this, we propose ATune, a novel fine-tuning framework for Automatic Attention Alignment Tuning. ATune leverages zero-shot weak labels from SAM, refines them into prompt-aware labels using BioMedCLIP, and then selectively modifies visually-critical attention heads to improve alignment while minimizing interference. Additionally, we introduce a AMoE module, enabling adaptive parameter selection for attention tuning across diverse prompts and images. Extensive experiments on medical VQA and report generation benchmarks show that ATune outperforms state-of-the-art baselines, achieving enhanced attention distributions and performance in Med-LVLMs.

Paper Structure

This paper contains 36 sections, 12 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: (A) Examples of medical VQA and attention maps on medical images. In this example of Brain MRI from the SLAKE dataset, red box denotes the RoI of the brain tumor that LLaVA-Med should focus on. Red texts and green texts indicate wrong answers and correct answers, respectively. (B) Example of ground truth RoIs for different prompts on an Abdomen CT from SLAKE.
  • Figure 2: (A) The overview of A$^3$Tune and (B) the details of the designed visual attention alignment tuning.
  • Figure 3: Motivation for using A$^3$MoE. The second column shows prompt-aware weak labels, with red bounding boxes and green inner segments. The third column shows the attention maps generated using shared parameters for the Query and Key matrices.
  • Figure 4: Effectiveness analysis of RoIs labels. Base is the base model, Control means adding ControlMLLM to align attention maps with ground truth labels, Weak uses weak labels, and GT uses ground truth labels.
  • Figure 5: Analysis of (a) the number of selected attention heads $R$ in A$^3$Tune and (b) the number of weak labels $K$, evaluated on the SLAKE dataset.
  • ...and 5 more figures