Table of Contents
Fetching ...

Not Just Text: Uncovering Vision Modality Typographic Threats in Image Generation Models

Hao Cheng, Erjia Xiao, Jiayan Yang, Jiahang Cao, Qiang Zhang, Jize Zhang, Kaidi Xu, Jindong Gu, Renjing Xu

TL;DR

The paper addresses the emerging risk that image generation models can be coerced into producing inappropriate or rights-violating content through vision-modality manipulation. It introduces typographic attack as a method to reveal vulnerabilities in the vision modality and assesses how current defenses fare against such threats, finding them ineffective. To standardize evaluation, it proposes the Vision Modal Threats in Image Generation Models (VMT-IGMs) dataset as a baseline resource. Overall, the work highlights the need for robust defenses that cover vision-based threats and provides a concrete benchmark for future research.

Abstract

Current image generation models can effortlessly produce high-quality, highly realistic images, but this also increases the risk of misuse. In various Text-to-Image or Image-to-Image tasks, attackers can generate a series of images containing inappropriate content by simply editing the language modality input. To mitigate this security concern, numerous guarding or defensive strategies have been proposed, with a particular emphasis on safeguarding language modality. However, in practical applications, threats in the vision modality, particularly in tasks involving the editing of real-world images, present heightened security risks as they can easily infringe upon the rights of the image owner. Therefore, this paper employs a method named typographic attack to reveal that various image generation models are also susceptible to threats within the vision modality. Furthermore, we also evaluate the defense performance of various existing methods when facing threats in the vision modality and uncover their ineffectiveness. Finally, we propose the Vision Modal Threats in Image Generation Models (VMT-IGMs) dataset, which would serve as a baseline for evaluating the vision modality vulnerability of various image generation models.

Not Just Text: Uncovering Vision Modality Typographic Threats in Image Generation Models

TL;DR

The paper addresses the emerging risk that image generation models can be coerced into producing inappropriate or rights-violating content through vision-modality manipulation. It introduces typographic attack as a method to reveal vulnerabilities in the vision modality and assesses how current defenses fare against such threats, finding them ineffective. To standardize evaluation, it proposes the Vision Modal Threats in Image Generation Models (VMT-IGMs) dataset as a baseline resource. Overall, the work highlights the need for robust defenses that cover vision-based threats and provides a concrete benchmark for future research.

Abstract

Current image generation models can effortlessly produce high-quality, highly realistic images, but this also increases the risk of misuse. In various Text-to-Image or Image-to-Image tasks, attackers can generate a series of images containing inappropriate content by simply editing the language modality input. To mitigate this security concern, numerous guarding or defensive strategies have been proposed, with a particular emphasis on safeguarding language modality. However, in practical applications, threats in the vision modality, particularly in tasks involving the editing of real-world images, present heightened security risks as they can easily infringe upon the rights of the image owner. Therefore, this paper employs a method named typographic attack to reveal that various image generation models are also susceptible to threats within the vision modality. Furthermore, we also evaluate the defense performance of various existing methods when facing threats in the vision modality and uncover their ineffectiveness. Finally, we propose the Vision Modal Threats in Image Generation Models (VMT-IGMs) dataset, which would serve as a baseline for evaluating the vision modality vulnerability of various image generation models.

Paper Structure

This paper contains 2 sections, 2 equations, 2 figures.

Table of Contents

  1. Introduction
  2. Mathematics

Figures (2)

  • Figure 1: Example of caption. It is set in Roman so that mathematics (always set in Roman: $B \sin A = A \sin B$) may be included without an ugly clash.
  • Figure 2: Example of a short caption, which should be centered.