Table of Contents
Fetching ...

MDIT-Bench: Evaluating the Dual-Implicit Toxicity in Large Multimodal Models

Bohan Jin, Shuhan Qi, Kehai Chen, Xinyi Guo, Xuan Wang

TL;DR

This work defines dual-implicit toxicity, a cross-modal prejudice signal that only surfaces when text and image data are considered together, and introduces the MDIT-Dataset (112,873 toxic questions) and MDIT-Bench (317,638 test items across 12 categories, 23 subcategories, 780 topics) built with a Multi-stage Human-in-loop In-context Generation pipeline. MDIT-Bench uses three difficulty levels Easy, Medium, and Hard, the latter incorporating Long-Context Jailbreaking, and introduces a Hidden Toxicity Metric to quantify toxicity that can be activated under long-context prompts. Evaluations on 13 prominent LMMs reveal limited sensitivity to dual-implicit toxicity, with hard-level results exposing substantial hidden toxicity that can be triggered under certain conditions. The work provides a large-scale, multimodal safety benchmark and dataset to advance research on reducing cross-modal prejudice and discrimination in LMMs and guiding future detoxification strategies.

Abstract

The widespread use of Large Multimodal Models (LMMs) has raised concerns about model toxicity. However, current research mainly focuses on explicit toxicity, with less attention to some more implicit toxicity regarding prejudice and discrimination. To address this limitation, we introduce a subtler type of toxicity named dual-implicit toxicity and a novel toxicity benchmark termed MDIT-Bench: Multimodal Dual-Implicit Toxicity Benchmark. Specifically, we first create the MDIT-Dataset with dual-implicit toxicity using the proposed Multi-stage Human-in-loop In-context Generation method. Based on this dataset, we construct the MDIT-Bench, a benchmark for evaluating the sensitivity of models to dual-implicit toxicity, with 317,638 questions covering 12 categories, 23 subcategories, and 780 topics. MDIT-Bench includes three difficulty levels, and we propose a metric to measure the toxicity gap exhibited by the model across them. In the experiment, we conducted MDIT-Bench on 13 prominent LMMs, and the results show that these LMMs cannot handle dual-implicit toxicity effectively. The model's performance drops significantly in hard level, revealing that these LMMs still contain a significant amount of hidden but activatable toxicity. Data are available at https://github.com/nuo1nuo/MDIT-Bench.

MDIT-Bench: Evaluating the Dual-Implicit Toxicity in Large Multimodal Models

TL;DR

This work defines dual-implicit toxicity, a cross-modal prejudice signal that only surfaces when text and image data are considered together, and introduces the MDIT-Dataset (112,873 toxic questions) and MDIT-Bench (317,638 test items across 12 categories, 23 subcategories, 780 topics) built with a Multi-stage Human-in-loop In-context Generation pipeline. MDIT-Bench uses three difficulty levels Easy, Medium, and Hard, the latter incorporating Long-Context Jailbreaking, and introduces a Hidden Toxicity Metric to quantify toxicity that can be activated under long-context prompts. Evaluations on 13 prominent LMMs reveal limited sensitivity to dual-implicit toxicity, with hard-level results exposing substantial hidden toxicity that can be triggered under certain conditions. The work provides a large-scale, multimodal safety benchmark and dataset to advance research on reducing cross-modal prejudice and discrimination in LMMs and guiding future detoxification strategies.

Abstract

The widespread use of Large Multimodal Models (LMMs) has raised concerns about model toxicity. However, current research mainly focuses on explicit toxicity, with less attention to some more implicit toxicity regarding prejudice and discrimination. To address this limitation, we introduce a subtler type of toxicity named dual-implicit toxicity and a novel toxicity benchmark termed MDIT-Bench: Multimodal Dual-Implicit Toxicity Benchmark. Specifically, we first create the MDIT-Dataset with dual-implicit toxicity using the proposed Multi-stage Human-in-loop In-context Generation method. Based on this dataset, we construct the MDIT-Bench, a benchmark for evaluating the sensitivity of models to dual-implicit toxicity, with 317,638 questions covering 12 categories, 23 subcategories, and 780 topics. MDIT-Bench includes three difficulty levels, and we propose a metric to measure the toxicity gap exhibited by the model across them. In the experiment, we conducted MDIT-Bench on 13 prominent LMMs, and the results show that these LMMs cannot handle dual-implicit toxicity effectively. The model's performance drops significantly in hard level, revealing that these LMMs still contain a significant amount of hidden but activatable toxicity. Data are available at https://github.com/nuo1nuo/MDIT-Bench.

Paper Structure

This paper contains 55 sections, 2 equations, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Three types of toxicity: (a) Explicit toxicity: containing directly offensive language; (b) Single-implicit toxicity: Not containing obvious offensive language, and the toxicity can be detected from either modality; (c) Dual-implicit toxicity: Not containing obvious offensive language, and the toxicity can be detected only by combining both modalities.
  • Figure 2: Toxicity categories of MDIT-Dataset. MDIT-Dataset is divided into 12 categories and 23 sub-categories, and the number of samples in each sub-category is approximately equal.
  • Figure 3: MDIT-Benchmark Construction Process: (1) Question Generation: Toxic questions and corresponding pseudo-multimodal questions are generated by the LLM, guided by artificially constructed demonstrations. (2) Data Cleaning: Questions are filtered based on the distribution of the Replaced Word. (3) Modal Expansion: Images are collected for the toxic questions using Replaced Word, transitioning from pseudo-multimodal to fully multimodal. (4) Benchmark Construction: Five answer options are provided for each question to construct the MDIT-Bench.
  • Figure 4: The distribution of the selected options at the medium level. Ans2 and Ans3 are the most frequently incorrectly selected options, indicating that the dual-implicit toxicity is tricky for LMMs. Ans1 to Ans5 are the five multiple-choice options, while "No ans" means that the model does not provide any answer.
  • Figure 5: The accuracy of each category at medium level. The detection difficulty across different categories varies and certain categories require further attention.
  • ...and 11 more figures