T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

Chen Yeh; You-Ming Chang; Wei-Chen Chiu; Ning Yu

T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

Chen Yeh, You-Ming Chang, Wei-Chen Chiu, Ning Yu

TL;DR

A comprehensive harmful dataset, Visual Harmful Dataset 11K (VHD11K), consisting of 10,000 images and 1,000 videos, crawled from the Internet and generated by 4 generative models across a total of 10 harmful categories covering a full spectrum of harmful concepts with nontrivial definition is proposed.

Abstract

To address the risks of encountering inappropriate or harmful content, researchers managed to incorporate several harmful contents datasets with machine learning methods to detect harmful concepts. However, existing harmful datasets are curated by the presence of a narrow range of harmful objects, and only cover real harmful content sources. This hinders the generalizability of methods based on such datasets, potentially leading to misjudgments. Therefore, we propose a comprehensive harmful dataset, Visual Harmful Dataset 11K (VHD11K), consisting of 10,000 images and 1,000 videos, crawled from the Internet and generated by 4 generative models, across a total of 10 harmful categories covering a full spectrum of harmful concepts with nontrivial definition. We also propose a novel annotation framework by formulating the annotation process as a multi-agent Visual Question Answering (VQA) task, having 3 different VLMs "debate" about whether the given image/video is harmful, and incorporating the in-context learning strategy in the debating process. Therefore, we can ensure that the VLMs consider the context of the given image/video and both sides of the arguments thoroughly before making decisions, further reducing the likelihood of misjudgments in edge cases. Evaluation and experimental results demonstrate that (1) the great alignment between the annotation from our novel annotation framework and those from human, ensuring the reliability of VHD11K; (2) our full-spectrum harmful dataset successfully identifies the inability of existing harmful content detection methods to detect extensive harmful contents and improves the performance of existing harmfulness recognition methods; (3) VHD11K outperforms the baseline dataset, SMID, as evidenced by the superior improvement in harmfulness recognition methods. The complete dataset and code can be found at https://github.com/nctu-eva-lab/VHD11K.

T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

TL;DR

Abstract

Paper Structure (36 sections, 6 figures, 4 tables)

This paper contains 36 sections, 6 figures, 4 tables.

Introduction
Related Work
Harmfulness detectors and harmful contents dataset
Vision-language models
VHD11K
Raw Data Collection and Preprocessing
Keywords
Collection of Real Data
Generation of Synthesized Data
Annotation Framework
Components and Process
In-context learning
Results
Experiments
Alignment with Human Annotation
...and 21 more sections

Figures (6)

Figure 1: Overview of the process of curating the whole dataset. "J.", "A." and "N." stand for the roles played by three GPT-4Vs, which are "judge", "affirmative debater", and "negative debater", respectively. "ICL samples" is short for in-context learning samples. Please note that the white rectangle masks serve as censorship, and are not included as inputs.
Figure 2: An example of the debate annotation framework. Please note that the white rectangle masks serve as censorship, and are not included as inputs. For detailed role definitions for each of the three agents, please refer to the appendix.
Figure 3: The number of real and synthesized harmful images/videos in each category. Since some images/videos may cover multiple harmful categories simultaneously, please note that the total number of visual contents in all the categories is slightly higher than that of harmful contents in the whole dataset.
Figure 4: Overview of iteratively selecting in-context learning samples. "J.", "A." and "N." stand for the roles played by three GPT-4Vs, which are "judge", "affirmative debater", and "negative debater", respectively. "ICL samples" is short for in-context learning samples. "TPR" and "TNR" are short for true positive rate and true negative rate respectively.
Figure 5: Examples of in-context learning samples and their corresponding expected responses.
...and 1 more figures

T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

TL;DR

Abstract

T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (6)