Table of Contents
Fetching ...

Labels or Input? Rethinking Augmentation in Multimodal Hate Detection

Sahajpreet Singh, Kokil Jaidka, Subhayan Mukerjee

TL;DR

This study tackles the challenge of robust multimodal hate detection by comparing two data-centric strategies: prompt optimization and multimodal augmentation. It demonstrates that structured prompts can provide meaningful soft supervision, especially for nuanced, scale-based hatefulness labels, while large multimodal systems can approach fine-tuning performance with category prompts on scaled tasks. Concurrently, a multi-agent counterfactual augmentation pipeline generates 2,479 neutral memes to decorrelate textual toxicity from visual cues, improving robustness and fairness with high-quality human validation. Together, these approaches offer a practical path to deployable, bias-aware multimodal hate-detection systems that reduce reliance on costly large-model inference and improve resilience to dataset artifacts.

Abstract

Online hate remains a significant societal challenge, especially as multimodal content enables subtle, culturally grounded, and implicit forms of harm. Hateful memes embed hostility through text-image interactions and humor, making them difficult for automated systems to interpret. Although recent Vision-Language Models (VLMs) perform well on explicit cases, their deployment is limited by high inference costs and persistent failures on nuanced content. This work examines how far small models can be improved through prompt optimization, fine-tuning, and automated data augmentation. We introduce an end-to-end pipeline that varies prompt structure, label granularity, and training modality, showing that structured prompts and scaled supervision significantly strengthen compact VLMs. We also develop a multimodal augmentation framework that generates counterfactually neutral memes via a coordinated LLM-VLM setup, reducing spurious correlations and improving the detection of implicit hate. Ablation studies quantify the contribution of each component, demonstrating that prompt design, granular labels, and targeted augmentation collectively narrow the gap between small and large models. The results offer a practical path toward more robust and deployable multimodal hate-detection systems without relying on costly large-model inference.

Labels or Input? Rethinking Augmentation in Multimodal Hate Detection

TL;DR

This study tackles the challenge of robust multimodal hate detection by comparing two data-centric strategies: prompt optimization and multimodal augmentation. It demonstrates that structured prompts can provide meaningful soft supervision, especially for nuanced, scale-based hatefulness labels, while large multimodal systems can approach fine-tuning performance with category prompts on scaled tasks. Concurrently, a multi-agent counterfactual augmentation pipeline generates 2,479 neutral memes to decorrelate textual toxicity from visual cues, improving robustness and fairness with high-quality human validation. Together, these approaches offer a practical path to deployable, bias-aware multimodal hate-detection systems that reduce reliance on costly large-model inference and improve resilience to dataset artifacts.

Abstract

Online hate remains a significant societal challenge, especially as multimodal content enables subtle, culturally grounded, and implicit forms of harm. Hateful memes embed hostility through text-image interactions and humor, making them difficult for automated systems to interpret. Although recent Vision-Language Models (VLMs) perform well on explicit cases, their deployment is limited by high inference costs and persistent failures on nuanced content. This work examines how far small models can be improved through prompt optimization, fine-tuning, and automated data augmentation. We introduce an end-to-end pipeline that varies prompt structure, label granularity, and training modality, showing that structured prompts and scaled supervision significantly strengthen compact VLMs. We also develop a multimodal augmentation framework that generates counterfactually neutral memes via a coordinated LLM-VLM setup, reducing spurious correlations and improving the detection of implicit hate. Ablation studies quantify the contribution of each component, demonstrating that prompt design, granular labels, and targeted augmentation collectively narrow the gap between small and large models. The results offer a practical path toward more robust and deployable multimodal hate-detection systems without relying on costly large-model inference.

Paper Structure

This paper contains 42 sections, 4 equations, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Overview of the augmentation pipeline. Red and green boxes indicate hateful and non-hateful content. In part B, the augmented data is created for the green–red pairs—cases where the background is non-hateful but is used in combination with a hateful caption, resulting in a hateful interpretation.
  • Figure 2: Quality scores of augmented non-hateful memes.