Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks
Maan Qraitem, Nazia Tasnim, Piotr Teterwak, Kate Saenko, Bryan A. Plummer
TL;DR
This work investigates typographic attacks on large vision-language models (LVLMs) and introduces Self-Generated Typographic Attacks that use the models themselves to craft deceptive text. It presents two attack families—Class-Based Attacks and Reasoned Attacks—that exploit visual similarity and language reasoning, respectively. Empirical results show these attacks can markedly degrade LVLM performance (up to ~60% drops) across multiple models and datasets, with Reasoned Attacks often delivering the strongest effects on LVLMs like GPT-4V. The findings highlight a critical vulnerability in LVLMs' reliance on textual cues and language understanding, underscoring the need for defenses and further evaluation across diverse models and domains.
Abstract
Typographic attacks, adding misleading text to images, can deceive vision-language models (LVLMs). The susceptibility of recent large LVLMs like GPT4-V to such attacks is understudied, raising concerns about amplified misinformation in personal assistant applications. Previous attacks use simple strategies, such as random misleading words, which don't fully exploit LVLMs' language reasoning abilities. We introduce an experimental setup for testing typographic attacks on LVLMs and propose two novel self-generated attacks: (1) Class-based attacks, where the model identifies a similar class to deceive itself, and (2) Reasoned attacks, where an advanced LVLM suggests an attack combining a deceiving class and description. Our experiments show these attacks significantly reduce classification performance by up to 60\% and are effective across different models, including InstructBLIP and MiniGPT4. Code: https://github.com/mqraitem/Self-Gen-Typo-Attack
