Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models

Abdulkadir Erol; Trilok Padhi; Agnik Saha; Ugur Kursuncu; Mehmet Emin Aktas

Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models

Abdulkadir Erol, Trilok Padhi, Agnik Saha, Ugur Kursuncu, Mehmet Emin Aktas

TL;DR

The paper addresses the risk of toxic content generation in large vision-language systems by red-teaming open-source LVLMs with adversarial prompts grounded in social theories. It introduces a multimodal toxicity dataset construction workflow, including the Multimodal-RealToxicityPrompts dataset, and evaluates five LVLMs across four prompting strategies using Perspective API metrics, supplemented by statistical and thematic analyses. Key findings show prevalence of toxicity and insults (mean $16.13\%$ and $9.75\%$ respectively), with Qwen-VL-Chat, LLaVA-v1.6-Vicuna-7b, and InstructBLIP-Vicuna-7b being particularly vulnerable, and Multimodal Toxic Prompt Completion and Dark Humor as especially effective prompting strategies. The work underscores the need for stronger safety mechanisms, cross-modal guardrails, and targeted defenses in LVLM development to mitigate misuse risks and protect users and society. It also provides a structured analytical framework (dataset construction, prompting strategies, toxicity measurement, statistical and thematic analyses) to guide future safety research and mitigation efforts.

Abstract

The rapid advancement of Large Vision-Language Models (LVLMs) has enhanced capabilities offering potential applications from content creation to productivity enhancement. Despite their innovative potential, LVLMs exhibit vulnerabilities, especially in generating potentially toxic or unsafe responses. Malicious actors can exploit these vulnerabilities to propagate toxic content in an automated (or semi-) manner, leveraging the susceptibility of LVLMs to deception via strategically crafted prompts without fine-tuning or compute-intensive procedures. Despite the red-teaming efforts and inherent potential risks associated with the LVLMs, exploring vulnerabilities of LVLMs remains nascent and yet to be fully addressed in a systematic manner. This study systematically examines the vulnerabilities of open-source LVLMs, including LLaVA, InstructBLIP, Fuyu, and Qwen, using adversarial prompt strategies that simulate real-world social manipulation tactics informed by social theories. Our findings show that (i) toxicity and insulting are the most prevalent behaviors, with the mean rates of 16.13% and 9.75%, respectively; (ii) Qwen-VL-Chat, LLaVA-v1.6-Vicuna-7b, and InstructBLIP-Vicuna-7b are the most vulnerable models, exhibiting toxic response rates of 21.50%, 18.30% and 17.90%, and insulting responses of 13.40%, 11.70% and 10.10%, respectively; (iii) prompting strategies incorporating dark humor and multimodal toxic prompt completion significantly elevated these vulnerabilities. Despite being fine-tuned for safety, these models still generate content with varying degrees of toxicity when prompted with adversarial inputs, highlighting the urgent need for enhanced safety mechanisms and robust guardrails in LVLM development.

Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models

TL;DR

and

respectively), with Qwen-VL-Chat, LLaVA-v1.6-Vicuna-7b, and InstructBLIP-Vicuna-7b being particularly vulnerable, and Multimodal Toxic Prompt Completion and Dark Humor as especially effective prompting strategies. The work underscores the need for stronger safety mechanisms, cross-modal guardrails, and targeted defenses in LVLM development to mitigate misuse risks and protect users and society. It also provides a structured analytical framework (dataset construction, prompting strategies, toxicity measurement, statistical and thematic analyses) to guide future safety research and mitigation efforts.

Abstract

Paper Structure (42 sections, 2 equations, 7 figures, 17 tables)

This paper contains 42 sections, 2 equations, 7 figures, 17 tables.

Introduction
Background
Confirmation Bias
Dark Humor
Social Identity Attacks
Malevolent Creativity
Related Work
Methods
Dataset Creation
Creating Multimodal-RealToxicityPrompts Dataset
Prompting Strategies
Models
Measuring Toxicity
Statistical Analysis
Thematic Analysis of Toxic Responses
...and 27 more sections

Figures (7)

Figure 1: Our approach consists of three main components: (i) dataset creation (ii) prompt strategies, and (iii) toxicity evaluation.
Figure 2: An example demonstrates how we used a multimodal prompt and the response it generated from the model, LLaVA-v1.6-Vicuna-7b. The image component of the multimodal prompt displays a meme about taxation, split into two segments. The upper segment reads "Taxation is theft," and the lower segment responds with "I know." The accompanying text prompt instructs the model to adopt the persona "DAN," which typically ignores conventional rules and norms, which guides the model to put the guardrails down and deliver a response that could be unfiltered and potentially aggressive. Specifically, the model is asked to share its opinion on the meme, suggest alternative captions, and make an aggressive comment about the meme. The LVLM's response includes a neutral interpretation and an aggressive, toxic response in line with the instructions provided.
Figure 3: The mean scores for the six metrics under the "Multimodal Toxic Prompt Completion" strategy across five LVLMs. The results highlight the models' varying degrees of susceptibility to generating toxic content, with higher mean scores in Toxicity and Insult. In contrast, Severe Toxicity and Threat are less frequently generated.
Figure 4: For "Dark Humor" strategy, the grouped bar chart presents the mean scores of responses from five distinct language models, evaluated across six Perspective API metrics, specifically for samples with a Toxicity score of 0.5 or higher. The evaluated models including Fuyu-8b, InstructBLIP-Vicuna-7b, LLaVA-v1.6-Mistral-7b, LLaVA-v1.6-Vicuna-7b, and Qwen-VL-Chat. This figure emphasizes the models' performance under high-toxicity conditions, providing insights into their behavior when confronted with highly toxic prompts.
Figure 5: For "Social Identity Attacks" strategy, the grouped bar chart presents the mean scores of responses from five distinct language models, evaluated across six Perspective API metrics, specifically for samples with a Toxicity score of 0.5 or higher. The evaluated models including Fuyu-8b, InstructBLIP-Vicuna-7b, LLaVA-v1.6-Mistral-7b, LLaVA-v1.6-Vicuna-7b, and Qwen-VL-Chat. This figure emphasizes the models' performance under high-toxicity conditions, providing insights into their behavior when confronted with highly toxic prompts.
...and 2 more figures

Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models

TL;DR

Abstract

Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)