Table of Contents
Fetching ...

Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model

Hao Cheng, Erjia Xiao, Jindong Gu, Le Yang, Jinhao Duan, Jize Zhang, Jiahang Cao, Kaidi Xu, Renjing Xu

TL;DR

This work verify typographic attacks on current well-known commercial and open-source LVLMs and uncover the widespread existence of this threat, and proposes the most comprehensive and largest-scale Typographic Dataset to date, which can reduce the performance degradation caused by typographic attacks.

Abstract

Large Vision-Language Models (LVLMs) rely on vision encoders and Large Language Models (LLMs) to exhibit remarkable capabilities on various multi-modal tasks in the joint space of vision and language. However, typographic attacks, which disrupt Vision-Language Models (VLMs) such as Contrastive Language-Image Pretraining (CLIP), have also been expected to be a security threat to LVLMs. Firstly, we verify typographic attacks on current well-known commercial and open-source LVLMs and uncover the widespread existence of this threat. Secondly, to better assess this vulnerability, we propose the most comprehensive and largest-scale Typographic Dataset to date. The Typographic Dataset not only considers the evaluation of typographic attacks under various multi-modal tasks but also evaluates the effects of typographic attacks, influenced by texts generated with diverse factors. Based on the evaluation results, we investigate the causes why typographic attacks impacting VLMs and LVLMs, leading to three highly insightful discoveries. During the process of further validating the rationality of our discoveries, we can reduce the performance degradation caused by typographic attacks from 42.07\% to 13.90\%. Code and Dataset are available in \href{https://github.com/ChaduCheng/TypoDeceptions}

Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model

TL;DR

This work verify typographic attacks on current well-known commercial and open-source LVLMs and uncover the widespread existence of this threat, and proposes the most comprehensive and largest-scale Typographic Dataset to date, which can reduce the performance degradation caused by typographic attacks.

Abstract

Large Vision-Language Models (LVLMs) rely on vision encoders and Large Language Models (LLMs) to exhibit remarkable capabilities on various multi-modal tasks in the joint space of vision and language. However, typographic attacks, which disrupt Vision-Language Models (VLMs) such as Contrastive Language-Image Pretraining (CLIP), have also been expected to be a security threat to LVLMs. Firstly, we verify typographic attacks on current well-known commercial and open-source LVLMs and uncover the widespread existence of this threat. Secondly, to better assess this vulnerability, we propose the most comprehensive and largest-scale Typographic Dataset to date. The Typographic Dataset not only considers the evaluation of typographic attacks under various multi-modal tasks but also evaluates the effects of typographic attacks, influenced by texts generated with diverse factors. Based on the evaluation results, we investigate the causes why typographic attacks impacting VLMs and LVLMs, leading to three highly insightful discoveries. During the process of further validating the rationality of our discoveries, we can reduce the performance degradation caused by typographic attacks from 42.07\% to 13.90\%. Code and Dataset are available in \href{https://github.com/ChaduCheng/TypoDeceptions}
Paper Structure (18 sections, 8 figures, 13 tables)

This paper contains 18 sections, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Typographic attacks on GPT-4V, Google Bard, LLaVA-v1.5 and MiniGPT-4.
  • Figure 2: Distractibility of LVLMs by typographic attacks in multi-modal tasks.
  • Figure 3: The illustration of different typographic factors.
  • Figure 4: (a) CLIP zero-shot classification results and LLaVA's response of a typographic image. (b) Grad-CAM of CLIP with various image-matching texts.
  • Figure 5: An informative prompt makes LLaVA focus on visual contents of the image. (a) Chat with LLaVA using a simple prompt and an informative prompt. (b) and (c) are Grad Activation Maps of the image (red areas indicate models' focal areas) and Attention Map of the sequence (light areas indicate tokens with higher levels of attention from LLaVA), respectively, when LLaVA confronts different prompts (the left side for the simple prompt and the right side for the informative prompt).
  • ...and 3 more figures