Pay Less Attention to Function Words for Free Robustness of Vision-Language Models
Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Chao Shen
TL;DR
The paper identifies function words as a source of distraction that harms cross-modal alignment in vision-language models under adversarial attacks. It introduces Function-word De-Attention (FDA), a plug-in mechanism that runs a parallel function-word cross-attention path and subtracts its influence from the original attention, yielding more robust representations. Extensive experiments across retrieval and visual grounding tasks on multiple models and datasets show FDA achieves substantial robustness gains with minimal or even positive impact on clean accuracy, and its benefits scale with backbone size. Ablation studies, zero-shot analysis, and visualization support FDA's effectiveness and generalization, while noting limitations and directions for extending to other backbone architectures.
Abstract
To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at https://github.com/michaeltian108/FDA.
