Table of Contents
Fetching ...

Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering

Ali Anaissi, Junaid Akram, Kunal Chaturvedi, Ali Braytee

TL;DR

The paper tackles hateful meme detection by addressing multimodal interactions between overlaid text and images. It proposes a modular pipeline that combines OCR, neutral captioning, sub-label retrieval, and a multi-turn Visual Question Answering (VQA) system within a Retrieval-Augmented Generation (RAG) framework. Empirical results on the Facebook Hateful Memes dataset show that the proposed RAG with sub-labels and VQA substantially improves accuracy and AUROC over unimodal baselines and prior multimodal models, though a gap to human performance remains. The work highlights the practical potential for moderation pipelines, while also noting computational constraints and opportunities for future enhancements such as knowledge graphs and optimized real-time processing.

Abstract

Memes are widely used for humor and cultural commentary, but they are increasingly exploited to spread hateful content. Due to their multimodal nature, hateful memes often evade traditional text-only or image-only detection systems, particularly when they employ subtle or coded references. To address these challenges, we propose a multimodal hate detection framework that integrates key components: OCR to extract embedded text, captioning to describe visual content neutrally, sub-label classification for granular categorization of hateful content, RAG for contextually relevant retrieval, and VQA for iterative analysis of symbolic and contextual cues. This enables the framework to uncover latent signals that simpler pipelines fail to detect. Experimental results on the Facebook Hateful Memes dataset reveal that the proposed framework exceeds the performance of unimodal and conventional multimodal models in both accuracy and AUC-ROC.

Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering

TL;DR

The paper tackles hateful meme detection by addressing multimodal interactions between overlaid text and images. It proposes a modular pipeline that combines OCR, neutral captioning, sub-label retrieval, and a multi-turn Visual Question Answering (VQA) system within a Retrieval-Augmented Generation (RAG) framework. Empirical results on the Facebook Hateful Memes dataset show that the proposed RAG with sub-labels and VQA substantially improves accuracy and AUROC over unimodal baselines and prior multimodal models, though a gap to human performance remains. The work highlights the practical potential for moderation pipelines, while also noting computational constraints and opportunities for future enhancements such as knowledge graphs and optimized real-time processing.

Abstract

Memes are widely used for humor and cultural commentary, but they are increasingly exploited to spread hateful content. Due to their multimodal nature, hateful memes often evade traditional text-only or image-only detection systems, particularly when they employ subtle or coded references. To address these challenges, we propose a multimodal hate detection framework that integrates key components: OCR to extract embedded text, captioning to describe visual content neutrally, sub-label classification for granular categorization of hateful content, RAG for contextually relevant retrieval, and VQA for iterative analysis of symbolic and contextual cues. This enables the framework to uncover latent signals that simpler pipelines fail to detect. Experimental results on the Facebook Hateful Memes dataset reveal that the proposed framework exceeds the performance of unimodal and conventional multimodal models in both accuracy and AUC-ROC.

Paper Structure

This paper contains 14 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: The overall framework, RAG (sub_label + VQA) for detecting hateful content
  • Figure 2: Outputs for a hateful example (left) and a non-hateful example (right). The pipeline includes accurate OCR detection, objective captioning, multi-turn VQA addressing targeted hate cues, and final classification via RAG.