Table of Contents
Fetching ...

Demystifying Hateful Content: Leveraging Large Multimodal Models for Hateful Meme Detection with Explainable Decisions

Ming Shan Hee, Roy Ka-Wei Lee

TL;DR

The paper tackles hateful meme detection by addressing the dual challenge of accuracy and explainability. It introduces IntMeme, a framework that prompts Large Multimodal Models to generate human-like meme interpretations in a zero-shot setting, then uses independent encoders for the interpretation and the meme to perform classification. Across three public datasets, IntMeme demonstrates superior performance over state-of-the-art PT-VLM baselines and gains from ablations that validate the value of high-quality interpretations and separate encoders. The work also includes a human evaluation and case studies to assess explainability, while discussing ethical considerations, limitations, and directions for safer deployment in real-world moderation systems.

Abstract

Hateful meme detection presents a significant challenge as a multimodal task due to the complexity of interpreting implicit hate messages and contextual cues within memes. Previous approaches have fine-tuned pre-trained vision-language models (PT-VLMs), leveraging the knowledge they gained during pre-training and their attention mechanisms to understand meme content. However, the reliance of these models on implicit knowledge and complex attention mechanisms renders their decisions difficult to explain, which is crucial for building trust in meme classification. In this paper, we introduce IntMeme, a novel framework that leverages Large Multimodal Models (LMMs) for hateful meme classification with explainable decisions. IntMeme addresses the dual challenges of improving both accuracy and explainability in meme moderation. The framework uses LMMs to generate human-like, interpretive analyses of memes, providing deeper insights into multimodal content and context. Additionally, it uses independent encoding modules for both memes and their interpretations, which are then combined to enhance classification performance. Our approach addresses the opacity and misclassification issues associated with PT-VLMs, optimizing the use of LMMs for hateful meme detection. We demonstrate the effectiveness of IntMeme through comprehensive experiments across three datasets, showcasing its superiority over state-of-the-art models.

Demystifying Hateful Content: Leveraging Large Multimodal Models for Hateful Meme Detection with Explainable Decisions

TL;DR

The paper tackles hateful meme detection by addressing the dual challenge of accuracy and explainability. It introduces IntMeme, a framework that prompts Large Multimodal Models to generate human-like meme interpretations in a zero-shot setting, then uses independent encoders for the interpretation and the meme to perform classification. Across three public datasets, IntMeme demonstrates superior performance over state-of-the-art PT-VLM baselines and gains from ablations that validate the value of high-quality interpretations and separate encoders. The work also includes a human evaluation and case studies to assess explainability, while discussing ethical considerations, limitations, and directions for safer deployment in real-world moderation systems.

Abstract

Hateful meme detection presents a significant challenge as a multimodal task due to the complexity of interpreting implicit hate messages and contextual cues within memes. Previous approaches have fine-tuned pre-trained vision-language models (PT-VLMs), leveraging the knowledge they gained during pre-training and their attention mechanisms to understand meme content. However, the reliance of these models on implicit knowledge and complex attention mechanisms renders their decisions difficult to explain, which is crucial for building trust in meme classification. In this paper, we introduce IntMeme, a novel framework that leverages Large Multimodal Models (LMMs) for hateful meme classification with explainable decisions. IntMeme addresses the dual challenges of improving both accuracy and explainability in meme moderation. The framework uses LMMs to generate human-like, interpretive analyses of memes, providing deeper insights into multimodal content and context. Additionally, it uses independent encoding modules for both memes and their interpretations, which are then combined to enhance classification performance. Our approach addresses the opacity and misclassification issues associated with PT-VLMs, optimizing the use of LMMs for hateful meme detection. We demonstrate the effectiveness of IntMeme through comprehensive experiments across three datasets, showcasing its superiority over state-of-the-art models.

Paper Structure

This paper contains 40 sections, 3 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of the proposed IntMeme's approach and its advantages in a content moderation process.
  • Figure 2: Overview of the IntMeme framework for hateful meme classification, comprising two modules: (1) Vision-Language Alignment and (2) Meme Interpretation Encoding.
  • Figure 3: LIME's visualization of the meme interpretation’s contribution towards IntMeme$_\text{mPLUG-Owl}$ model's prediction