Table of Contents
Fetching ...

Towards Training-free Multimodal Hate Localisation with Large Language Models

Yueming Sun, Long Yang, Jianbo Jiao, Zeyu Fu

TL;DR

This work proposes LELA, the first training-free Large Language Model (LLM) based framework for hate video localization, which leverages LLMs and modality-specific captioning to detect and temporally localize hateful content in a training-free manner.

Abstract

The proliferation of hateful content in online videos poses severe threats to individual well-being and societal harmony. However, existing solutions for video hate detection either rely heavily on large-scale human annotations or lack fine-grained temporal precision. In this work, we propose LELA, the first training-free Large Language Model (LLM) based framework for hate video localization. Distinct from state-of-the-art models that depend on supervised pipelines, LELA leverages LLMs and modality-specific captioning to detect and temporally localize hateful content in a training-free manner. Our method decomposes a video into five modalities, including image, speech, OCR, music, and video context, and uses a multi-stage prompting scheme to compute fine-grained hateful scores for each frame. We further introduce a composition matching mechanism to enhance cross-modal reasoning. Experiments on two challenging benchmarks, HateMM and MultiHateClip, demonstrate that LELA outperforms all existing training-free baselines by a large margin. We also provide extensive ablations and qualitative visualizations, establishing LELA as a strong foundation for scalable and interpretable hate video localization.

Towards Training-free Multimodal Hate Localisation with Large Language Models

TL;DR

This work proposes LELA, the first training-free Large Language Model (LLM) based framework for hate video localization, which leverages LLMs and modality-specific captioning to detect and temporally localize hateful content in a training-free manner.

Abstract

The proliferation of hateful content in online videos poses severe threats to individual well-being and societal harmony. However, existing solutions for video hate detection either rely heavily on large-scale human annotations or lack fine-grained temporal precision. In this work, we propose LELA, the first training-free Large Language Model (LLM) based framework for hate video localization. Distinct from state-of-the-art models that depend on supervised pipelines, LELA leverages LLMs and modality-specific captioning to detect and temporally localize hateful content in a training-free manner. Our method decomposes a video into five modalities, including image, speech, OCR, music, and video context, and uses a multi-stage prompting scheme to compute fine-grained hateful scores for each frame. We further introduce a composition matching mechanism to enhance cross-modal reasoning. Experiments on two challenging benchmarks, HateMM and MultiHateClip, demonstrate that LELA outperforms all existing training-free baselines by a large margin. We also provide extensive ablations and qualitative visualizations, establishing LELA as a strong foundation for scalable and interpretable hate video localization.
Paper Structure (16 sections, 8 equations, 4 figures, 7 tables)

This paper contains 16 sections, 8 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: We propose the first LLM-based framework for video hate localization, which addresses the challenge of interpretability in hate content moderation and locates frame-level hateful content from multimodal video input.
  • Figure 2: Conventional video anomaly detection (a) focuses on visual signals and aims to detect short abnormal snippets within an otherwise normal video, typically producing coarse video-level anomaly scores. Video hate localization (b) jointly leverages multiple modalities (image, audio, and text), attends to high-level semantic cues related to hateful or offensive content, and assigns fine-grained frame-level scores, enabling precise localization of harmful regions within the video.
  • Figure 3: Overview of the LELA framework. Given an input video, LELA decomposes it into five modalities: video snippets, static images, background music, speech, and OCR text. Each modality is processed by a dedicated captioning model to extract frame-aligned textual descriptions. A multi-stage prompting strategy (green) evaluates frame-level modality captions from complementary perspectives (e.g., explicit hate, implied hate, target groups), while a parallel composition matching module (pink) summarizes salient information across time and modalities. The resulting textual representations are fed into an LLM to produce per-modality scores, and the final frame-level hate score is obtained by taking the maximum across modalities, enabling fine-grained localization of hateful content and providing interpretable evidence for moderation decisions.
  • Figure 4: We showcase qualitative results obtained by our framework on four test videos, including two examples from the HateMM dataset (top row) and two from the MultiHateClip dataset (bottom row). For each video, we plot the predicted hate score across frames, as computed by our method. We display selected keyframes along with their most relevant modality caption. Frames predicted as non-hateful are marked with blue bounding boxes, while frames predicted as hateful are marked with red bounding boxes. Ground-truth hateful segments are also highlighted in pink, providing a reference to assess the accuracy and interpretability of our localization results.