Table of Contents
Fetching ...

Typographic Attacks in a Multi-Image Setting

Xiaomeng Wang, Zhengyu Zhao, Martha Larson

TL;DR

This work studies typographic attacks on Large Vision-Language Models in a realistic multi-image setting where attack texts must be non-repetitive across an image set. It demonstrates that text-image similarity, computed in the embedding space, is a strong predictor of attack success and motivates two strategy families: text-image similarity–based and attack-text-effectiveness–based approaches, evaluated under non-repeating, one-to-one matching. The study shows that text-image similarity strategies substantially outperform random baselines (e.g., up to a 21% gain on ImageNet with CLIP) and maintain stealth by avoiding repeated attack texts; these results transfer to other LVLMs such as InstructBLIP in greybox settings. The findings highlight embedding-space vulnerabilities in LVLMs and provide a framework for evaluating defenses, while enabling future work on more naturalistic attack texts and broader model coverage.

Abstract

Large Vision-Language Models (LVLMs) are susceptible to typographic attacks, which are misclassifications caused by an attack text that is added to an image. In this paper, we introduce a multi-image setting for studying typographic attacks, broadening the current emphasis of the literature on attacking individual images. Specifically, our focus is on attacking image sets without repeating the attack query. Such non-repeating attacks are stealthier, as they are more likely to evade a gatekeeper than attacks that repeat the same attack text. We introduce two attack strategies for the multi-image setting, leveraging the difficulty of the target image, the strength of the attack text, and text-image similarity. Our text-image similarity approach improves attack success rates by 21% over random, non-specific methods on the CLIP model using ImageNet while maintaining stealth in a multi-image scenario. An additional experiment demonstrates transferability, i.e., text-image similarity calculated using CLIP transfers when attacking InstructBLIP.

Typographic Attacks in a Multi-Image Setting

TL;DR

This work studies typographic attacks on Large Vision-Language Models in a realistic multi-image setting where attack texts must be non-repetitive across an image set. It demonstrates that text-image similarity, computed in the embedding space, is a strong predictor of attack success and motivates two strategy families: text-image similarity–based and attack-text-effectiveness–based approaches, evaluated under non-repeating, one-to-one matching. The study shows that text-image similarity strategies substantially outperform random baselines (e.g., up to a 21% gain on ImageNet with CLIP) and maintain stealth by avoiding repeated attack texts; these results transfer to other LVLMs such as InstructBLIP in greybox settings. The findings highlight embedding-space vulnerabilities in LVLMs and provide a framework for evaluating defenses, while enabling future work on more naturalistic attack texts and broader model coverage.

Abstract

Large Vision-Language Models (LVLMs) are susceptible to typographic attacks, which are misclassifications caused by an attack text that is added to an image. In this paper, we introduce a multi-image setting for studying typographic attacks, broadening the current emphasis of the literature on attacking individual images. Specifically, our focus is on attacking image sets without repeating the attack query. Such non-repeating attacks are stealthier, as they are more likely to evade a gatekeeper than attacks that repeat the same attack text. We introduce two attack strategies for the multi-image setting, leveraging the difficulty of the target image, the strength of the attack text, and text-image similarity. Our text-image similarity approach improves attack success rates by 21% over random, non-specific methods on the CLIP model using ImageNet while maintaining stealth in a multi-image scenario. An additional experiment demonstrates transferability, i.e., text-image similarity calculated using CLIP transfers when attacking InstructBLIP.

Paper Structure

This paper contains 26 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: In real-world attack scenarios, an attacker would target a CLIP-based system with a set of images rather than a single image. The use of a repeating text (as in a) more strongly signals a typographic attack than the use of diversified texts (as in b). In this paper, we introduce the multi-image attack setting, which limits attack text repetition, and we show the importance of text-image similarity when choosing an attack text for a given target image.
  • Figure 2: Attack success rates of our 579 attack texts. The attack texts are arranged in ascending order of untargeted attack success rate along the horizontal axis.
  • Figure 3: Attack success rates versus visual image prediction probability for three categories of attack texts: highly, moderately, and minimally effective.
  • Figure 4: Attack success rates versus text-image similarity for three categories of attack texts: highly, moderately, and minimally effective. The attack success rates generally increase as text-image similarity rises.
  • Figure 5: One-to-one matching between the image and attack text sets. The image and text sets are ranked by descending VIPP and text-image similarity. The attack text is selected from the ranked list (excluding the ground truth label and previously used texts). The text and numeric values under each image represent its prediction label and probability, while the bracketed values for each attack text indicate text-image similarity.
  • ...and 3 more figures