Table of Contents
Fetching ...

Reasoning-Aware Multimodal Fusion for Hateful Video Detection

Shuonan Yang, Tailin Chen, Jiangbei Yue, Guangliang Cheng, Jianbo Jiao, Zeyu Fu

TL;DR

The paper tackles hateful content in online videos by addressing how context and cross-modal semantics shape detection. It introduces RAMF, a Reasoning-Aware Multimodal Fusion framework that combines adversarial reasoning with Local-Global Context Fusion and Semantic Cross Attention to enable fine-grained multimodal interaction. A structured three-stage Vision-Language Model adversarial reasoning pipeline generates objective descriptions, hate-associated inferences, and non-hate inferences to enrich contextual understanding while maintaining grounding. Experiments on HateMM and MultiHateClip demonstrate state-of-the-art performance and robust generalization, highlighting RAMF's potential for reliable, context-aware moderation in multimodal content settings.

Abstract

Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.

Reasoning-Aware Multimodal Fusion for Hateful Video Detection

TL;DR

The paper tackles hateful content in online videos by addressing how context and cross-modal semantics shape detection. It introduces RAMF, a Reasoning-Aware Multimodal Fusion framework that combines adversarial reasoning with Local-Global Context Fusion and Semantic Cross Attention to enable fine-grained multimodal interaction. A structured three-stage Vision-Language Model adversarial reasoning pipeline generates objective descriptions, hate-associated inferences, and non-hate inferences to enrich contextual understanding while maintaining grounding. Experiments on HateMM and MultiHateClip demonstrate state-of-the-art performance and robust generalization, highlighting RAMF's potential for reliable, context-aware moderation in multimodal content settings.

Abstract

Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.

Paper Structure

This paper contains 25 sections, 6 equations, 13 figures, 5 tables, 1 algorithm.

Figures (13)

  • Figure 1: Left: Two main challenges—fusion of multimodal semantic relations and nuances of context understanding. Right: Standard paradigms vs. our RAMF.
  • Figure 2: The overall architecture of the proposed framework, including the Local-Global Context Fusion (LGCF) module, the Semantic Cross Attention (SCA) mechanism.
  • Figure 3: Prompt used in Stage 1 (Objective Description).
  • Figure 4: Prompt used in Stage 2 (Hate-Assumed Inference).
  • Figure 5: Prompt used in Stage 3 (Non-Hate-Assumed Inference).
  • ...and 8 more figures