Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning

Aofei Chang; Le Huang; Alex James Boyd; Parminder Bhatia; Taha Kass-Hout; Cao Xiao; Fenglong Ma

Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning

Aofei Chang, Le Huang, Alex James Boyd, Parminder Bhatia, Taha Kass-Hout, Cao Xiao, Fenglong Ma

TL;DR

This work tackles hallucination and attention misalignment in medical vision-language models by introducing A$^3$Tune, a training-time framework that automatically aligns attention to diagnostically critical regions. It combines prompt-aware weak labels generated from SAM and refined with BioMedCLIP, selective tuning of visually-critical attention heads, and a novel A$^3$MoE mechanism to adapt parameters across prompts and images. The approach is coupled with a mask-based attention objective and a final objective that balances language modeling with alignment, yielding superior performance on medical VQA and report generation benchmarks and improved visual grounding. The method demonstrates strong generalization across datasets and models, suggesting a practical path toward more reliable and interpretable Med-LVLMs in clinical settings.

Abstract

Medical Large Vision-Language Models (Med-LVLMs) often exhibit suboptimal attention distribution on visual inputs, leading to hallucinated or inaccurate outputs. Existing mitigation methods primarily rely on inference-time interventions, which are limited in attention adaptation or require additional supervision. To address this, we propose A$^3$Tune, a novel fine-tuning framework for Automatic Attention Alignment Tuning. A$^3$Tune leverages zero-shot weak labels from SAM, refines them into prompt-aware labels using BioMedCLIP, and then selectively modifies visually-critical attention heads to improve alignment while minimizing interference. Additionally, we introduce a A$^3$MoE module, enabling adaptive parameter selection for attention tuning across diverse prompts and images. Extensive experiments on medical VQA and report generation benchmarks show that A$^3$Tune outperforms state-of-the-art baselines, achieving enhanced attention distributions and performance in Med-LVLMs.

Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning

TL;DR

This work tackles hallucination and attention misalignment in medical vision-language models by introducing A

Tune, a training-time framework that automatically aligns attention to diagnostically critical regions. It combines prompt-aware weak labels generated from SAM and refined with BioMedCLIP, selective tuning of visually-critical attention heads, and a novel A

MoE mechanism to adapt parameters across prompts and images. The approach is coupled with a mask-based attention objective and a final objective that balances language modeling with alignment, yielding superior performance on medical VQA and report generation benchmarks and improved visual grounding. The method demonstrates strong generalization across datasets and models, suggesting a practical path toward more reliable and interpretable Med-LVLMs in clinical settings.

Abstract

Tune, a novel fine-tuning framework for Automatic Attention Alignment Tuning. A

Tune leverages zero-shot weak labels from SAM, refines them into prompt-aware labels using BioMedCLIP, and then selectively modifies visually-critical attention heads to improve alignment while minimizing interference. Additionally, we introduce a A

MoE module, enabling adaptive parameter selection for attention tuning across diverse prompts and images. Extensive experiments on medical VQA and report generation benchmarks show that A

Tune outperforms state-of-the-art baselines, achieving enhanced attention distributions and performance in Med-LVLMs.

Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning

TL;DR

Abstract

Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)