Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models

Hao Cheng; Erjia Xiao; Yichi Wang; Lingfeng Zhang; Qiang Zhang; Jiahang Cao; Kaidi Xu; Mengshu Sun; Xiaoshuai Hao; Jindong Gu; Renjing Xu

Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models

Hao Cheng, Erjia Xiao, Yichi Wang, Lingfeng Zhang, Qiang Zhang, Jiahang Cao, Kaidi Xu, Mengshu Sun, Xiaoshuai Hao, Jindong Gu, Renjing Xu

TL;DR

This work examines Typographic Visual Prompt Injection (TVPI) as a security threat to cross-modality generation models, spanning Vision-Language Perception and Image-to-Image tasks. It introduces the TVPI Dataset to systematically evaluate TVPI across open- and closed-source LVLMs and I2I GMs, analyzing how visual prompts with target semantics disrupt outputs and how factors like text size, opacity, and position influence impact. The study reveals that TVPI can significantly alter model behavior, with large models often showing heightened vulnerability and defenses providing only partial mitigation. Findings highlight practical security risks in real-world cross-vision systems and motivate development of robust defenses and safer prompt-handling mechanisms.

Abstract

Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-Vision tasks, encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), have attracted significant attention. Large Vision Language Models (LVLMs) and I2I Generation Models (GMs) are employed to handle VLP and I2I tasks, respectively. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to produce disruptive outputs that are semantically aligned with those words. Additionally, visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of cross-vision tasks. However, the specific characteristics of the threats posed by visual prompts remain underexplored. In this paper, to comprehensively investigate the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs, we propose the Typographic Visual Prompts Injection Dataset and thoroughly evaluate the TVPI security risks on various open-source and closed-source LVLMs and I2I GMs under visual prompts with different target semantics, deepening the understanding of TVPI threats.

Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models

TL;DR

Abstract

Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)