Table of Contents
Fetching ...

When Harmful Content Gets Camouflaged: Unveiling Perception Failure of LVLMs with CamHarmTI

Yanhui Li, Qi Zhou, Zhihong Xu, Huizhong Guo, Wenhai Wang, Dongxia Wang

TL;DR

CamHarmTI delivers a targeted benchmark to probe LVLMs’ ability to perceive and interpret camouflaged harmful text embedded in images. The study shows a pronounced human–LVLM perceptual gap, with humans reliably detecting camouflaged cues while LVLMs fail under three camouflage strategies. Through supervised fine-tuning on CamHarmTI, LVLMs achieve substantial gains in camouflaged-text recognition and harmfulness perception, primarily by adjusting early visual encoder layers, without sacrificing overall multimodal performance. The work provides both a diagnostic tool for perceptual gaps and a practical dataset for training more human-aligned visual reasoning in multimodal safety contexts.

Abstract

Large vision-language models (LVLMs) are increasingly used for tasks where detecting multimodal harmful content is crucial, such as online content moderation. However, real-world harmful content is often camouflaged, relying on nuanced text-image interplay, such as memes or images with embedded malicious text, to evade detection. This raises a key question: \textbf{can LVLMs perceive such camouflaged harmful content as sensitively as humans do?} In this paper, we introduce CamHarmTI, a benchmark for evaluating LVLM ability to perceive and interpret camouflaged harmful content within text-image compositions. CamHarmTI consists of over 4,500 samples across three types of image-text posts. Experiments on 100 human users and 12 mainstream LVLMs reveal a clear perceptual gap: humans easily recognize such content (e.g., over 95.75\% accuracy), whereas current LVLMs often fail (e.g., ChatGPT-4o achieves only 2.10\% accuracy). Moreover, fine-tuning experiments demonstrate that \bench serves as an effective resource for improving model perception, increasing accuracy by 55.94\% for Qwen2.5VL-7B. Attention analysis and layer-wise probing further reveal that fine-tuning enhances sensitivity primarily in the early layers of the vision encoder, promoting a more integrated scene understanding. These findings highlight the inherent perceptual limitations in LVLMs and offer insight into more human-aligned visual reasoning systems.

When Harmful Content Gets Camouflaged: Unveiling Perception Failure of LVLMs with CamHarmTI

TL;DR

CamHarmTI delivers a targeted benchmark to probe LVLMs’ ability to perceive and interpret camouflaged harmful text embedded in images. The study shows a pronounced human–LVLM perceptual gap, with humans reliably detecting camouflaged cues while LVLMs fail under three camouflage strategies. Through supervised fine-tuning on CamHarmTI, LVLMs achieve substantial gains in camouflaged-text recognition and harmfulness perception, primarily by adjusting early visual encoder layers, without sacrificing overall multimodal performance. The work provides both a diagnostic tool for perceptual gaps and a practical dataset for training more human-aligned visual reasoning in multimodal safety contexts.

Abstract

Large vision-language models (LVLMs) are increasingly used for tasks where detecting multimodal harmful content is crucial, such as online content moderation. However, real-world harmful content is often camouflaged, relying on nuanced text-image interplay, such as memes or images with embedded malicious text, to evade detection. This raises a key question: \textbf{can LVLMs perceive such camouflaged harmful content as sensitively as humans do?} In this paper, we introduce CamHarmTI, a benchmark for evaluating LVLM ability to perceive and interpret camouflaged harmful content within text-image compositions. CamHarmTI consists of over 4,500 samples across three types of image-text posts. Experiments on 100 human users and 12 mainstream LVLMs reveal a clear perceptual gap: humans easily recognize such content (e.g., over 95.75\% accuracy), whereas current LVLMs often fail (e.g., ChatGPT-4o achieves only 2.10\% accuracy). Moreover, fine-tuning experiments demonstrate that \bench serves as an effective resource for improving model perception, increasing accuracy by 55.94\% for Qwen2.5VL-7B. Attention analysis and layer-wise probing further reveal that fine-tuning enhances sensitivity primarily in the early layers of the vision encoder, promoting a more integrated scene understanding. These findings highlight the inherent perceptual limitations in LVLMs and offer insight into more human-aligned visual reasoning systems.

Paper Structure

This paper contains 36 sections, 2 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: The overview of CamHarmTI. It features five violation categories and three camouflaging types, combining harmful texts within image contexts to examine how LVLMs and humans perceive visually concealed content.
  • Figure 2: Dataset generation of CamHarmTI, including Preparation, Image Generation, and Image-text Post Generation.
  • Figure 3: Testing results of LLava1.5-7B and Qwen2.5VL-7B on MM-Vet before and after SFT.
  • Figure 4: Downsampling and Noise Injection Experiment on IllusionText and ShadowText Tasks with CTR(%) of Three LVLMs.
  • Figure 5: Grad-CAM for Qwen2.5-VL-7B and Llava1.5-7B on Comp Text, before and after SFT.
  • ...and 9 more figures