Table of Contents
Fetching ...

MultiHateClip: A Multilingual Benchmark Dataset for Hateful Video Detection on YouTube and Bilibili

Han Wang, Tan Rui Yang, Usman Naseem, Roy Ka-Wei Lee

TL;DR

MultiHateClip tackles hateful video detection in a multilingual, multimodal setting by constructing a dataset of 2,000 English and Chinese short videos from YouTube and Bilibili annotated for hatefulness, offensiveness, and normalcy, with granular segment and modality annotations. The dataset is built using hate lexicons and human annotation, enabling cross-cultural analysis of gender-based hate and modality contributions. Benchmark experiments show that vision-language and multimodal models substantially outperform unimodal baselines, though non-Western content remains challenging due to limited training data and the prevalence of implicit hate. The work provides a foundational resource and empirical insights for developing culturally aware, multimodal hate-detection systems.

Abstract

Hate speech is a pressing issue in modern society, with significant effects both online and offline. Recent research in hate speech detection has primarily centered on text-based media, largely overlooking multimodal content such as videos. Existing studies on hateful video datasets have predominantly focused on English content within a Western context and have been limited to binary labels (hateful or non-hateful), lacking detailed contextual information. This study presents MultiHateClip1 , an novel multilingual dataset created through hate lexicons and human annotation. It aims to enhance the detection of hateful videos on platforms such as YouTube and Bilibili, including content in both English and Chinese languages. Comprising 2,000 videos annotated for hatefulness, offensiveness, and normalcy, this dataset provides a cross-cultural perspective on gender-based hate speech. Through a detailed examination of human annotation results, we discuss the differences between Chinese and English hateful videos and underscore the importance of different modalities in hateful and offensive video analysis. Evaluations of state-of-the-art video classification models, such as VLM, GPT-4V and Qwen-VL, on MultiHateClip highlight the existing challenges in accurately distinguishing between hateful and offensive content and the urgent need for models that are both multimodally and culturally nuanced. MultiHateClip represents a foundational advance in enhancing hateful video detection by underscoring the necessity of a multimodal and culturally sensitive approach in combating online hate speech.

MultiHateClip: A Multilingual Benchmark Dataset for Hateful Video Detection on YouTube and Bilibili

TL;DR

MultiHateClip tackles hateful video detection in a multilingual, multimodal setting by constructing a dataset of 2,000 English and Chinese short videos from YouTube and Bilibili annotated for hatefulness, offensiveness, and normalcy, with granular segment and modality annotations. The dataset is built using hate lexicons and human annotation, enabling cross-cultural analysis of gender-based hate and modality contributions. Benchmark experiments show that vision-language and multimodal models substantially outperform unimodal baselines, though non-Western content remains challenging due to limited training data and the prevalence of implicit hate. The work provides a foundational resource and empirical insights for developing culturally aware, multimodal hate-detection systems.

Abstract

Hate speech is a pressing issue in modern society, with significant effects both online and offline. Recent research in hate speech detection has primarily centered on text-based media, largely overlooking multimodal content such as videos. Existing studies on hateful video datasets have predominantly focused on English content within a Western context and have been limited to binary labels (hateful or non-hateful), lacking detailed contextual information. This study presents MultiHateClip1 , an novel multilingual dataset created through hate lexicons and human annotation. It aims to enhance the detection of hateful videos on platforms such as YouTube and Bilibili, including content in both English and Chinese languages. Comprising 2,000 videos annotated for hatefulness, offensiveness, and normalcy, this dataset provides a cross-cultural perspective on gender-based hate speech. Through a detailed examination of human annotation results, we discuss the differences between Chinese and English hateful videos and underscore the importance of different modalities in hateful and offensive video analysis. Evaluations of state-of-the-art video classification models, such as VLM, GPT-4V and Qwen-VL, on MultiHateClip highlight the existing challenges in accurately distinguishing between hateful and offensive content and the urgent need for models that are both multimodally and culturally nuanced. MultiHateClip represents a foundational advance in enhancing hateful video detection by underscoring the necessity of a multimodal and culturally sensitive approach in combating online hate speech.
Paper Structure (22 sections, 3 figures, 10 tables)

This paper contains 22 sections, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Amplitude of English YouTube videos. Y-axis: Amplitude Indicator, X-axis: Time(sec.)
  • Figure 2: Zero Crossing Rate of English YouTube videos. Y-axis: Zero Crossing Indicator, X-axis: Time(sec.)
  • Figure 3: Framework of the multi-modal model. FC: Fully Connected Layer.