Table of Contents
Fetching ...

ShieldVLM: Safeguarding the Multimodal Implicit Toxicity via Deliberative Reasoning with LVLMs

Shiyao Cui, Qinglin Zhang, Xuan Ouyang, Renmiao Chen, Zhexin Zhang, Yida Lu, Hongning Wang, Han Qiu, Minlie Huang

TL;DR

This work addresses the challenge of multimodal implicit toxicity, where text and image separately appear safe but jointly convey harm. It introduces a taxonomy of cross-modal correlations, builds the MMIT-dataset with 2,100 instances across 7 risk categories, and proposes ShieldVLM, a deliberately reasoning-based moderator that analyzes text-image content to detect both explicit and implicit toxicity. ShieldVLM is trained with reasoning outputs to provide explainable safety assessments and demonstrates superior performance to existing moderation APIs and LVLM baselines on both in-domain and out-of-distribution data. The dataset and model are released to support future research and safer deployment of multimodal content systems.

Abstract

Toxicity detection in multimodal text-image content faces growing challenges, especially with multimodal implicit toxicity, where each modality appears benign on its own but conveys hazard when combined. Multimodal implicit toxicity appears not only as formal statements in social platforms but also prompts that can lead to toxic dialogs from Large Vision-Language Models (LVLMs). Despite the success in unimodal text or image moderation, toxicity detection for multimodal content, particularly the multimodal implicit toxicity, remains underexplored. To fill this gap, we comprehensively build a taxonomy for multimodal implicit toxicity (MMIT) and introduce an MMIT-dataset, comprising 2,100 multimodal statements and prompts across 7 risk categories (31 sub-categories) and 5 typical cross-modal correlation modes. To advance the detection of multimodal implicit toxicity, we build ShieldVLM, a model which identifies implicit toxicity in multimodal statements, prompts and dialogs via deliberative cross-modal reasoning. Experiments show that ShieldVLM outperforms existing strong baselines in detecting both implicit and explicit toxicity. The model and dataset will be publicly available to support future researches. Warning: This paper contains potentially sensitive contents.

ShieldVLM: Safeguarding the Multimodal Implicit Toxicity via Deliberative Reasoning with LVLMs

TL;DR

This work addresses the challenge of multimodal implicit toxicity, where text and image separately appear safe but jointly convey harm. It introduces a taxonomy of cross-modal correlations, builds the MMIT-dataset with 2,100 instances across 7 risk categories, and proposes ShieldVLM, a deliberately reasoning-based moderator that analyzes text-image content to detect both explicit and implicit toxicity. ShieldVLM is trained with reasoning outputs to provide explainable safety assessments and demonstrates superior performance to existing moderation APIs and LVLM baselines on both in-domain and out-of-distribution data. The dataset and model are released to support future research and safer deployment of multimodal content systems.

Abstract

Toxicity detection in multimodal text-image content faces growing challenges, especially with multimodal implicit toxicity, where each modality appears benign on its own but conveys hazard when combined. Multimodal implicit toxicity appears not only as formal statements in social platforms but also prompts that can lead to toxic dialogs from Large Vision-Language Models (LVLMs). Despite the success in unimodal text or image moderation, toxicity detection for multimodal content, particularly the multimodal implicit toxicity, remains underexplored. To fill this gap, we comprehensively build a taxonomy for multimodal implicit toxicity (MMIT) and introduce an MMIT-dataset, comprising 2,100 multimodal statements and prompts across 7 risk categories (31 sub-categories) and 5 typical cross-modal correlation modes. To advance the detection of multimodal implicit toxicity, we build ShieldVLM, a model which identifies implicit toxicity in multimodal statements, prompts and dialogs via deliberative cross-modal reasoning. Experiments show that ShieldVLM outperforms existing strong baselines in detecting both implicit and explicit toxicity. The model and dataset will be publicly available to support future researches. Warning: This paper contains potentially sensitive contents.

Paper Structure

This paper contains 26 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Examples for multimodal implicit toxicity in forms of multimodal statement, prompt and dialog.
  • Figure 2: Performance gap of representative moderation APIs/models to detect the explicit and implicit toxicity.
  • Figure 3: Illustration to the cross-modal correlation modes.
  • Figure 4: Illustration to the format, reasoning process and construction of ShieldVLM.
  • Figure 5: Model performances across correlation modes.
  • ...and 2 more figures