Table of Contents
Fetching ...

SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments

Yue Cao, Yun Xing, Jie Zhang, Di Lin, Tianwei Zhang, Ivor Tsang, Yang Liu, Qing Guo

TL;DR

SceneTAP introduces an LLM-driven planner to generate scene-coherent typographic adversarial text and optimal insertion locations, followed by a diffusion-based TextDiffuser to embed the text naturally into real-world scenes. By separating text generation, placement, and insertion, the method achieves high attack success while preserving visual plausibility, and extends to physical environments with printed patches. Across TypoD-base, LingoQA, and VQAv2, SceneTAP consistently outperforms Center and Margin baselines and improves performance on both open-ended and two-choice questions, including evaluations on multiple LVLMs and a robust commercial model. The work exposes vulnerabilities in LVLMs to sophisticated, context-aware typographic attacks, offering insights for developing defenses and more resilient multimodal systems.

Abstract

Large vision-language models (LVLMs) have shown remarkable capabilities in interpreting visual content. While existing works demonstrate these models' vulnerability to deliberately placed adversarial texts, such texts are often easily identifiable as anomalous. In this paper, we present the first approach to generate scene-coherent typographic adversarial attacks that mislead advanced LVLMs while maintaining visual naturalness through the capability of the LLM-based agent. Our approach addresses three critical questions: what adversarial text to generate, where to place it within the scene, and how to integrate it seamlessly. We propose a training-free, multi-modal LLM-driven scene-coherent typographic adversarial planning (SceneTAP) that employs a three-stage process: scene understanding, adversarial planning, and seamless integration. The SceneTAP utilizes chain-of-thought reasoning to comprehend the scene, formulate effective adversarial text, strategically plan its placement, and provide detailed instructions for natural integration within the image. This is followed by a scene-coherent TextDiffuser that executes the attack using a local diffusion mechanism. We extend our method to real-world scenarios by printing and placing generated patches in physical environments, demonstrating its practical implications. Extensive experiments show that our scene-coherent adversarial text successfully misleads state-of-the-art LVLMs, including ChatGPT-4o, even after capturing new images of physical setups. Our evaluations demonstrate a significant increase in attack success rates while maintaining visual naturalness and contextual appropriateness. This work highlights vulnerabilities in current vision-language models to sophisticated, scene-coherent adversarial attacks and provides insights into potential defense mechanisms.

SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments

TL;DR

SceneTAP introduces an LLM-driven planner to generate scene-coherent typographic adversarial text and optimal insertion locations, followed by a diffusion-based TextDiffuser to embed the text naturally into real-world scenes. By separating text generation, placement, and insertion, the method achieves high attack success while preserving visual plausibility, and extends to physical environments with printed patches. Across TypoD-base, LingoQA, and VQAv2, SceneTAP consistently outperforms Center and Margin baselines and improves performance on both open-ended and two-choice questions, including evaluations on multiple LVLMs and a robust commercial model. The work exposes vulnerabilities in LVLMs to sophisticated, context-aware typographic attacks, offering insights for developing defenses and more resilient multimodal systems.

Abstract

Large vision-language models (LVLMs) have shown remarkable capabilities in interpreting visual content. While existing works demonstrate these models' vulnerability to deliberately placed adversarial texts, such texts are often easily identifiable as anomalous. In this paper, we present the first approach to generate scene-coherent typographic adversarial attacks that mislead advanced LVLMs while maintaining visual naturalness through the capability of the LLM-based agent. Our approach addresses three critical questions: what adversarial text to generate, where to place it within the scene, and how to integrate it seamlessly. We propose a training-free, multi-modal LLM-driven scene-coherent typographic adversarial planning (SceneTAP) that employs a three-stage process: scene understanding, adversarial planning, and seamless integration. The SceneTAP utilizes chain-of-thought reasoning to comprehend the scene, formulate effective adversarial text, strategically plan its placement, and provide detailed instructions for natural integration within the image. This is followed by a scene-coherent TextDiffuser that executes the attack using a local diffusion mechanism. We extend our method to real-world scenarios by printing and placing generated patches in physical environments, demonstrating its practical implications. Extensive experiments show that our scene-coherent adversarial text successfully misleads state-of-the-art LVLMs, including ChatGPT-4o, even after capturing new images of physical setups. Our evaluations demonstrate a significant increase in attack success rates while maintaining visual naturalness and contextual appropriateness. This work highlights vulnerabilities in current vision-language models to sophisticated, scene-coherent adversarial attacks and provides insights into potential defense mechanisms.

Paper Structure

This paper contains 26 sections, 3 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: (a) An example of inserting 4 types of adversarial texts. (b) Quantitative results of 4 types of adversarial texts on 100 image-question pairs when we attack LLaVA-1.5-13b model. We use the attack success rate (ASR) as the metric. (c)-Left: Influence of Adversarial Text Placement, with examples of Attack Strength Heatmaps for specific questions featuring adversarial text in different locations. (c)-Right: Influence of the placement of adversarial text on two cases. We insert specified adversarial texts at grid points in the image. The question for the first case is "What color is the towel in the image?" with choices gray (adversarial text) and white (correct answer). The question for the second case is "What entity is depicted in the image?" with choices plate (adversarial text) and garter snake (correct answer). The attack strength map highlights areas with higher attack strengths, represented by warmer colors (red).
  • Figure 2: Pipeline of our scene-coherent typographic adversarial planner (SceneTAP) and its intermediate outputs leading to the final generated image.
  • Figure 3: Visualization comparing SceneTAP adversarial examples: Digital SceneTAP (generated) and Physical SceneTAP (real-world implementation). Physical examples were created by printing the generated texts (shown in right subfigure), applying them to identical scenes, and capturing new photographs. The bottom row displays response comparisons from four VLMs across all three image variants.
  • Figure 4: Ablation study on the influence of the main components in SceneTAP.
  • Figure 5: Visualization of the N-Score assessment across different score ranges. The arrows indicate the locations of the added text within each image.
  • ...and 3 more figures