Table of Contents
Fetching ...

SAFE-MEME: Structured Reasoning Framework for Robust Hate Speech Detection in Memes

Palash Nandi, Shivam Sharma, Tanmoy Chakraborty

TL;DR

SAFE-MEME tackles robust hate-speech detection in memes by introducing structured multimodal reasoning. It presents two variants: SAFE-MEME-QA, a Q&A-style MM-CoT approach, and SAFE-MEME-H, a hierarchical description-plus-classification method. The authors created two datasets, MHS and MHS-Con, to benchmark fine-grained and stress-test scenarios. Empirical results show that SAFE-MEME-QA and SAFE-MEME-H outperform multiple baselines on both datasets, with gains up to around 6 percentage points in F1 and provide insights into robustness and error patterns. The work highlights the potential of structured reasoning to improve multimodal hate-speech detection, while acknowledging dataset biases and ethical considerations.

Abstract

Memes act as cryptic tools for sharing sensitive ideas, often requiring contextual knowledge to interpret. This makes moderating multimodal memes challenging, as existing works either lack high-quality datasets on nuanced hate categories or rely on low-quality social media visuals. Here, we curate two novel multimodal hate speech datasets, MHS and MHS-Con, that capture fine-grained hateful abstractions in regular and confounding scenarios, respectively. We benchmark these datasets against several competing baselines. Furthermore, we introduce SAFE-MEME (Structured reAsoning FramEwork), a novel multimodal Chain-of-Thought-based framework employing Q&A-style reasoning (SAFE-MEME-QA) and hierarchical categorization (SAFE-MEME-H) to enable robust hate speech detection in memes. SAFE-MEME-QA outperforms existing baselines, achieving an average improvement of approximately 5% and 4% on MHS and MHS-Con, respectively. In comparison, SAFE-MEME-H achieves an average improvement of 6% in MHS while outperforming only multimodal baselines in MHS-Con. We show that fine-tuning a single-layer adapter within SAFE-MEME-H outperforms fully fine-tuned models in regular fine-grained hateful meme detection. However, the fully fine-tuning approach with a Q&A setup is more effective for handling confounding cases. We also systematically examine the error cases, offering valuable insights into the robustness and limitations of the proposed structured reasoning framework for analyzing hateful memes.

SAFE-MEME: Structured Reasoning Framework for Robust Hate Speech Detection in Memes

TL;DR

SAFE-MEME tackles robust hate-speech detection in memes by introducing structured multimodal reasoning. It presents two variants: SAFE-MEME-QA, a Q&A-style MM-CoT approach, and SAFE-MEME-H, a hierarchical description-plus-classification method. The authors created two datasets, MHS and MHS-Con, to benchmark fine-grained and stress-test scenarios. Empirical results show that SAFE-MEME-QA and SAFE-MEME-H outperform multiple baselines on both datasets, with gains up to around 6 percentage points in F1 and provide insights into robustness and error patterns. The work highlights the potential of structured reasoning to improve multimodal hate-speech detection, while acknowledging dataset biases and ethical considerations.

Abstract

Memes act as cryptic tools for sharing sensitive ideas, often requiring contextual knowledge to interpret. This makes moderating multimodal memes challenging, as existing works either lack high-quality datasets on nuanced hate categories or rely on low-quality social media visuals. Here, we curate two novel multimodal hate speech datasets, MHS and MHS-Con, that capture fine-grained hateful abstractions in regular and confounding scenarios, respectively. We benchmark these datasets against several competing baselines. Furthermore, we introduce SAFE-MEME (Structured reAsoning FramEwork), a novel multimodal Chain-of-Thought-based framework employing Q&A-style reasoning (SAFE-MEME-QA) and hierarchical categorization (SAFE-MEME-H) to enable robust hate speech detection in memes. SAFE-MEME-QA outperforms existing baselines, achieving an average improvement of approximately 5% and 4% on MHS and MHS-Con, respectively. In comparison, SAFE-MEME-H achieves an average improvement of 6% in MHS while outperforming only multimodal baselines in MHS-Con. We show that fine-tuning a single-layer adapter within SAFE-MEME-H outperforms fully fine-tuned models in regular fine-grained hateful meme detection. However, the fully fine-tuning approach with a Q&A setup is more effective for handling confounding cases. We also systematically examine the error cases, offering valuable insights into the robustness and limitations of the proposed structured reasoning framework for analyzing hateful memes.
Paper Structure (40 sections, 10 equations, 15 figures, 6 tables)

This paper contains 40 sections, 10 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: The demonstration of our proposed Chain-of-Thought-based structured reasoning framework for fine-grained hate speech detection in memes (SAFE-MEME) via (a) Q&A-style reasoning ($\texttt{SAFE-MEME}\texttt{-QA}$), and (b) Hierarchical categorization ($\texttt{SAFE-MEME}\texttt{-H}$). Given a meme, the proposed variant $\texttt{SAFE-MEME}\texttt{-QA}$ sequentially generates a series of relevant questions-answers while the other proposed variant $\texttt{SAFE-MEME}\texttt{-H}$ opts for a two-level classification approach based on a detailed visual description ([GDESC]), before the final inference.
  • Figure 2: An illustration of annotations in the $\texttt{MHS}$ dataset. Each instance is associated with a hatefulness label -- explicit, implicit, or benign, along with augmented information like general description and Q&A. The general description describes important attributes of the entities as well as the inter-entity relationships. Additionally, the context is expressed as a sequence of question and answer pairs in Q&A component.
  • Figure 3: An illustration of instances in the $\texttt{MHS-Con}$ dataset. Each unique visual content is associated with three distinct variations of textual hate content -- (a) explicit hate, (b) implicit hate, and (c) benign. Notice the given visual instance, when paired with text (a), (b), and (c), emerges as instances of explicit, implicit, and benign cases of hate speech, respectively, in memes. Also shown is a comparison of the category prediction by a closed-source (GPT4-o) and two open-source (miniGPT4 and miniGPT-v2) large VLMs.
  • Figure 4: An illustration of the counts (expressed as a percentage) of the various protected groups in the $\texttt{MHS}$ dataset. The distribution of the protected groups is presented in four distinct setups when considering the instances of (a) explicit, (b) implicit, (c) benign and (d) all categories together.
  • Figure 5: Architectural details of $\texttt{SAFE-MEME}\texttt{-QA}$. Phase 1: Encoders -- textual (T5-base) and visual (ViT) generate a multimodal signal via gated fusion, serving as input to the T5-base decoder (DEC$_{QGen}$) to generate questions, Y$_{query}$. Phase 2: A fused multimodal signal is fed to another T5-base decoder (DEC$_{RGen}$) for each question y$_{i} \in$ Y$_{query}$ to generate responses, Y$_{answer}$. The hate label $L$ of meme $M$ is identified via a regular expression.
  • ...and 10 more figures