Doubly-Universal Adversarial Perturbations: Deceiving Vision-Language Models Across Both Images and Text with a Single Perturbation
Hee-Seon Kim, Minbeom Kim, Changick Kim
TL;DR
We address the vulnerability of Vision-Language Models (VLMs) to adversarial perturbations by introducing Doubly-UAP, a universal perturbation optimized on the vision encoder to degrade representations for both image and text inputs. The method targets the vision encoder's attention mechanism, focusing on value vectors in middle-to-late layers, and is trained in a label-free, black-box fashion with a perturbation budget bounded by $||\delta||_{\infty} \le \epsilon$. Empirically, Doubly-UAP achieves state-of-the-art attack rates across classification, image captioning, and visual question answering on multiple VLMs (LLaVA, LLaVA-1.5, InstructBLIP) and vision encoders (CLIP-224/336, EVA-CLIP), outperforming baselines that attack only image or text embeddings. The results reveal that disrupting vision-encoder representations can drastically degrade both visual understanding and subsequent language generation, underscoring the need for robust defenses and informing future research on defending cross-modal models against universal threats.
Abstract
Large Vision-Language Models (VLMs) have demonstrated remarkable performance across multimodal tasks by integrating vision encoders with large language models (LLMs). However, these models remain vulnerable to adversarial attacks. Among such attacks, Universal Adversarial Perturbations (UAPs) are especially powerful, as a single optimized perturbation can mislead the model across various input images. In this work, we introduce a novel UAP specifically designed for VLMs: the Doubly-Universal Adversarial Perturbation (Doubly-UAP), capable of universally deceiving VLMs across both image and text inputs. To successfully disrupt the vision encoder's fundamental process, we analyze the core components of the attention mechanism. After identifying value vectors in the middle-to-late layers as the most vulnerable, we optimize Doubly-UAP in a label-free manner with a frozen model. Despite being developed as a black-box to the LLM, Doubly-UAP achieves high attack success rates on VLMs, consistently outperforming baseline methods across vision-language tasks. Extensive ablation studies and analyses further demonstrate the robustness of Doubly-UAP and provide insights into how it influences internal attention mechanisms.
