SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments

Yue Cao; Yun Xing; Jie Zhang; Di Lin; Tianwei Zhang; Ivor Tsang; Yang Liu; Qing Guo

SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments

Yue Cao, Yun Xing, Jie Zhang, Di Lin, Tianwei Zhang, Ivor Tsang, Yang Liu, Qing Guo

TL;DR

SceneTAP introduces an LLM-driven planner to generate scene-coherent typographic adversarial text and optimal insertion locations, followed by a diffusion-based TextDiffuser to embed the text naturally into real-world scenes. By separating text generation, placement, and insertion, the method achieves high attack success while preserving visual plausibility, and extends to physical environments with printed patches. Across TypoD-base, LingoQA, and VQAv2, SceneTAP consistently outperforms Center and Margin baselines and improves performance on both open-ended and two-choice questions, including evaluations on multiple LVLMs and a robust commercial model. The work exposes vulnerabilities in LVLMs to sophisticated, context-aware typographic attacks, offering insights for developing defenses and more resilient multimodal systems.

Abstract

Large vision-language models (LVLMs) have shown remarkable capabilities in interpreting visual content. While existing works demonstrate these models' vulnerability to deliberately placed adversarial texts, such texts are often easily identifiable as anomalous. In this paper, we present the first approach to generate scene-coherent typographic adversarial attacks that mislead advanced LVLMs while maintaining visual naturalness through the capability of the LLM-based agent. Our approach addresses three critical questions: what adversarial text to generate, where to place it within the scene, and how to integrate it seamlessly. We propose a training-free, multi-modal LLM-driven scene-coherent typographic adversarial planning (SceneTAP) that employs a three-stage process: scene understanding, adversarial planning, and seamless integration. The SceneTAP utilizes chain-of-thought reasoning to comprehend the scene, formulate effective adversarial text, strategically plan its placement, and provide detailed instructions for natural integration within the image. This is followed by a scene-coherent TextDiffuser that executes the attack using a local diffusion mechanism. We extend our method to real-world scenarios by printing and placing generated patches in physical environments, demonstrating its practical implications. Extensive experiments show that our scene-coherent adversarial text successfully misleads state-of-the-art LVLMs, including ChatGPT-4o, even after capturing new images of physical setups. Our evaluations demonstrate a significant increase in attack success rates while maintaining visual naturalness and contextual appropriateness. This work highlights vulnerabilities in current vision-language models to sophisticated, scene-coherent adversarial attacks and provides insights into potential defense mechanisms.

SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments

TL;DR

Abstract

SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)