Table of Contents
Fetching ...

Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models

Hao Cheng, Erjia Xiao, Yichi Wang, Lingfeng Zhang, Qiang Zhang, Jiahang Cao, Kaidi Xu, Mengshu Sun, Xiaoshuai Hao, Jindong Gu, Renjing Xu

TL;DR

This work examines Typographic Visual Prompt Injection (TVPI) as a security threat to cross-modality generation models, spanning Vision-Language Perception and Image-to-Image tasks. It introduces the TVPI Dataset to systematically evaluate TVPI across open- and closed-source LVLMs and I2I GMs, analyzing how visual prompts with target semantics disrupt outputs and how factors like text size, opacity, and position influence impact. The study reveals that TVPI can significantly alter model behavior, with large models often showing heightened vulnerability and defenses providing only partial mitigation. Findings highlight practical security risks in real-world cross-vision systems and motivate development of robust defenses and safer prompt-handling mechanisms.

Abstract

Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-Vision tasks, encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), have attracted significant attention. Large Vision Language Models (LVLMs) and I2I Generation Models (GMs) are employed to handle VLP and I2I tasks, respectively. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to produce disruptive outputs that are semantically aligned with those words. Additionally, visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of cross-vision tasks. However, the specific characteristics of the threats posed by visual prompts remain underexplored. In this paper, to comprehensively investigate the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs, we propose the Typographic Visual Prompts Injection Dataset and thoroughly evaluate the TVPI security risks on various open-source and closed-source LVLMs and I2I GMs under visual prompts with different target semantics, deepening the understanding of TVPI threats.

Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models

TL;DR

This work examines Typographic Visual Prompt Injection (TVPI) as a security threat to cross-modality generation models, spanning Vision-Language Perception and Image-to-Image tasks. It introduces the TVPI Dataset to systematically evaluate TVPI across open- and closed-source LVLMs and I2I GMs, analyzing how visual prompts with target semantics disrupt outputs and how factors like text size, opacity, and position influence impact. The study reveals that TVPI can significantly alter model behavior, with large models often showing heightened vulnerability and defenses providing only partial mitigation. Findings highlight practical security risks in real-world cross-vision systems and motivate development of robust defenses and safer prompt-handling mechanisms.

Abstract

Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-Vision tasks, encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), have attracted significant attention. Large Vision Language Models (LVLMs) and I2I Generation Models (GMs) are employed to handle VLP and I2I tasks, respectively. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to produce disruptive outputs that are semantically aligned with those words. Additionally, visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of cross-vision tasks. However, the specific characteristics of the threats posed by visual prompts remain underexplored. In this paper, to comprehensively investigate the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs, we propose the Typographic Visual Prompts Injection Dataset and thoroughly evaluate the TVPI security risks on various open-source and closed-source LVLMs and I2I GMs under visual prompts with different target semantics, deepening the understanding of TVPI threats.

Paper Structure

This paper contains 17 sections, 3 figures, 4 tables, 3 algorithms.

Figures (3)

  • Figure 1: The framework of Typographic Visual Prompt Injection threats of various open-source and closed-source LVLMs and I2I GMs for VLP and I2I tasks. In VLP and I2I tasks, there are 4 sub-tasks and 2 sub-tasks implemented through different input text prompts. The target visual prompts in I2I task are Harmful (naked, bloody), Bias (African, Asian), and Neutral (glasses, hat) content.
  • Figure 2: The impact of typographic visual prompt injection and typographic word injection on open-source and closed-source I2I GMs. (left) original clean images. (middle) Generated images affected by typographic visual prompt injection. (right) Generated images of closed-source I2I GMs affected by typographic word injection.
  • Figure 3: The impact of typographic visual prompt and typographic word injection on different targets in VLP tasks (measured by average ASR across four subtasks)