Table of Contents
Fetching ...

Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Chao Shen

TL;DR

The paper identifies function words as a source of distraction that harms cross-modal alignment in vision-language models under adversarial attacks. It introduces Function-word De-Attention (FDA), a plug-in mechanism that runs a parallel function-word cross-attention path and subtracts its influence from the original attention, yielding more robust representations. Extensive experiments across retrieval and visual grounding tasks on multiple models and datasets show FDA achieves substantial robustness gains with minimal or even positive impact on clean accuracy, and its benefits scale with backbone size. Ablation studies, zero-shot analysis, and visualization support FDA's effectiveness and generalization, while noting limitations and directions for extending to other backbone architectures.

Abstract

To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at https://github.com/michaeltian108/FDA.

Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

TL;DR

The paper identifies function words as a source of distraction that harms cross-modal alignment in vision-language models under adversarial attacks. It introduces Function-word De-Attention (FDA), a plug-in mechanism that runs a parallel function-word cross-attention path and subtracts its influence from the original attention, yielding more robust representations. Extensive experiments across retrieval and visual grounding tasks on multiple models and datasets show FDA achieves substantial robustness gains with minimal or even positive impact on clean accuracy, and its benefits scale with backbone size. Ablation studies, zero-shot analysis, and visualization support FDA's effectiveness and generalization, while noting limitations and directions for extending to other backbone architectures.

Abstract

To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at https://github.com/michaeltian108/FDA.

Paper Structure

This paper contains 22 sections, 7 equations, 4 figures, 26 tables.

Figures (4)

  • Figure 1: Grad-CAM of attention maps of VLM under white-box untargeted attacks through perturbed images. The texts are given at the bottom of the figure, with function words highlighted. Left: The VLM correctly recognizes the female student on the clean image given the token her. Mid: The VLM is distracted by the adversarial perturbation and partially looks at the male coach. Right: The distraction is mitigated by simply applying masks to remove all function words: the VLM successfully 'looks back at' the female student.
  • Figure 2: Left: An illustration of our Function-word De-Attention (FDA) method. On the existing process of attention calculation, which uses $\mathcal{F}_V$ and $\mathcal{F}_T$, we add a parallel pipeline to calculate the attentions between function words $\mathcal{F}_{T_f}$ and the images $\mathcal{F}_{V}$. Afterwards, the function-attention passes a control gate $\mathcal{G}$ before entering the FDA module (triangle) differentially to subtract distractions as presented in Eq.\ref{['eq:subtract']}. Right: We speculate that attacks can easily cross the boundary for misalignments for less aligned models (top), and by removing function-word distractions, models can learn a robust embedding (bottom), preventing misalignments.
  • Figure 3: Left: T-SNE of the vision-language embedding of vanilla VLM, FDA, FARE, and TeCoA. Our FDA is the most aligned model. Right: Comparison of text-image similarity for vanilla VLM versus VLM + FDA. Our FDA yields better alignment with larger similarities and smaller variances.
  • Figure 4: A heatmap of attention probabilities given the same image and text inputs. Left: Original attention probabilities are relatively 'noisy' and have several visible stripes with very low probabilities, implying the existence of some less relevant visual tokens that are activated, with negligible contributions. Mid: Attention probabilities with one FDA subtraction show much less aforementioned 'stripes', with much cleaner and more focused attentions. However, some distractions still exist and remain visible. Right: Attention probabilities with two subtractions show the cleanest attention maps and have the most negligible distractions, with only strong activations on the most relevant visual tokens, i.e., with higher probabilities.