Table of Contents
Fetching ...

Beyond Hate: Differentiating Uncivil and Intolerant Speech in Multimodal Content Moderation

Nils A. Herrmann, Tobias Eder, Jingyi He, Georg Groh

Abstract

Current multimodal toxicity benchmarks typically use a single binary hatefulness label. This coarse approach conflates two fundamentally different characteristics of expression: tone and content. Drawing on communication science theory, we introduce a fine-grained annotation scheme that distinguishes two separable dimensions: incivility (rude or dismissive tone) and intolerance (content that attacks pluralism and targets groups or identities) and apply it to 2,030 memes from the Hateful Memes dataset. We evaluate different vision-language models under coarse-label training, transfer learning across label schemes and a joint learning approach that combines the coarse hatefulness label with our fine-grained annotations. Our results show that fine-grained annotations complement existing coarse labels and, when used jointly, improve overall model performance. Moreover, models trained with the fine-grained scheme exhibit more balanced moderation-relevant error profiles and are less prone to under-detection of harmful content than models trained on hatefulness labels alone (FNR-FPR, the difference between false negative and false positive rates: 0.74 to 0.42 for LLaVA-1.6-Mistral-7B; 0.54 to 0.28 for Qwen2.5-VL-7B). This work contributes to data-centric approaches in content moderation by improving the reliability and accuracy of moderation systems through enhanced data quality. Overall, combining both coarse and fine-grained labels provides a practical route to more reliable multimodal moderation.

Beyond Hate: Differentiating Uncivil and Intolerant Speech in Multimodal Content Moderation

Abstract

Current multimodal toxicity benchmarks typically use a single binary hatefulness label. This coarse approach conflates two fundamentally different characteristics of expression: tone and content. Drawing on communication science theory, we introduce a fine-grained annotation scheme that distinguishes two separable dimensions: incivility (rude or dismissive tone) and intolerance (content that attacks pluralism and targets groups or identities) and apply it to 2,030 memes from the Hateful Memes dataset. We evaluate different vision-language models under coarse-label training, transfer learning across label schemes and a joint learning approach that combines the coarse hatefulness label with our fine-grained annotations. Our results show that fine-grained annotations complement existing coarse labels and, when used jointly, improve overall model performance. Moreover, models trained with the fine-grained scheme exhibit more balanced moderation-relevant error profiles and are less prone to under-detection of harmful content than models trained on hatefulness labels alone (FNR-FPR, the difference between false negative and false positive rates: 0.74 to 0.42 for LLaVA-1.6-Mistral-7B; 0.54 to 0.28 for Qwen2.5-VL-7B). This work contributes to data-centric approaches in content moderation by improving the reliability and accuracy of moderation systems through enhanced data quality. Overall, combining both coarse and fine-grained labels provides a practical route to more reliable multimodal moderation.
Paper Structure (42 sections, 6 figures, 17 tables)

This paper contains 42 sections, 6 figures, 17 tables.

Figures (6)

  • Figure 1: Example memes from the hateful meme dataset that illustrate the concept of 'benign confounders'. Image above is a compilation of assets, including ©Getty Images.
  • Figure 2: Evaluation pipeline. A meme–text pair is passed to the VLM (e.g. LLaVA-1.6) along with a prompt asking whether the content is "Hateful" or "Neutral". The model generates a binary classification based on both visual and textual input. Image above is a compilation of assets, including ©Getty Images.
  • Figure 3: Validation accuracy for hatefulness prediction across different shares of annotated training data.
  • Figure 4: Frequency of fine-grained incivility (left) and intolerance (right) labels across all annotations before aggregation. A single meme may receive multiple labels.
  • Figure 5: Hatefulness empirical probability conditioned on each category. Error bars represent the Wilson Score Interval at 0.05 significance level.
  • ...and 1 more figures