Table of Contents
Fetching ...

Towards Explainable Harmful Meme Detection through Multimodal Debate between Large Language Models

Hongzhan Lin, Ziyang Luo, Wei Gao, Jing Ma, Bo Wang, Ruichao Yang

TL;DR

This paper tackles the challenge of identifying harmful memes with transparent explanations by introducing ExplainHM, a framework that orchestrates a multimodal debate between LLMs to produce opposing rationales (harmless vs. harmful) and then trains a smaller, efficient judge to fuse these rationales with meme text and imagery for inference. The approach enables dialectical reasoning over implicit harm-indicative patterns and delivers readable explanations alongside predictions. Across three public meme datasets, ExplainHM outperforms state-of-the-art baselines in detection performance and demonstrates robust explainability through automatic and human evaluations, as well as case studies. The work highlights the practical significance of interpretable multimodal reasoning for content moderation and offers a scalable path to leveraging LLM-derived explanations without deploying massive models at inference time.

Abstract

The age of social media is flooded with Internet memes, necessitating a clear grasp and effective identification of harmful ones. This task presents a significant challenge due to the implicit meaning embedded in memes, which is not explicitly conveyed through the surface text and image. However, existing harmful meme detection methods do not present readable explanations that unveil such implicit meaning to support their detection decisions. In this paper, we propose an explainable approach to detect harmful memes, achieved through reasoning over conflicting rationales from both harmless and harmful positions. Specifically, inspired by the powerful capacity of Large Language Models (LLMs) on text generation and reasoning, we first elicit multimodal debate between LLMs to generate the explanations derived from the contradictory arguments. Then we propose to fine-tune a small language model as the debate judge for harmfulness inference, to facilitate multimodal fusion between the harmfulness rationales and the intrinsic multimodal information within memes. In this way, our model is empowered to perform dialectical reasoning over intricate and implicit harm-indicative patterns, utilizing multimodal explanations originating from both harmless and harmful arguments. Extensive experiments on three public meme datasets demonstrate that our harmful meme detection approach achieves much better performance than state-of-the-art methods and exhibits a superior capacity for explaining the meme harmfulness of the model predictions.

Towards Explainable Harmful Meme Detection through Multimodal Debate between Large Language Models

TL;DR

This paper tackles the challenge of identifying harmful memes with transparent explanations by introducing ExplainHM, a framework that orchestrates a multimodal debate between LLMs to produce opposing rationales (harmless vs. harmful) and then trains a smaller, efficient judge to fuse these rationales with meme text and imagery for inference. The approach enables dialectical reasoning over implicit harm-indicative patterns and delivers readable explanations alongside predictions. Across three public meme datasets, ExplainHM outperforms state-of-the-art baselines in detection performance and demonstrates robust explainability through automatic and human evaluations, as well as case studies. The work highlights the practical significance of interpretable multimodal reasoning for content moderation and offers a scalable path to leveraging LLM-derived explanations without deploying massive models at inference time.

Abstract

The age of social media is flooded with Internet memes, necessitating a clear grasp and effective identification of harmful ones. This task presents a significant challenge due to the implicit meaning embedded in memes, which is not explicitly conveyed through the surface text and image. However, existing harmful meme detection methods do not present readable explanations that unveil such implicit meaning to support their detection decisions. In this paper, we propose an explainable approach to detect harmful memes, achieved through reasoning over conflicting rationales from both harmless and harmful positions. Specifically, inspired by the powerful capacity of Large Language Models (LLMs) on text generation and reasoning, we first elicit multimodal debate between LLMs to generate the explanations derived from the contradictory arguments. Then we propose to fine-tune a small language model as the debate judge for harmfulness inference, to facilitate multimodal fusion between the harmfulness rationales and the intrinsic multimodal information within memes. In this way, our model is empowered to perform dialectical reasoning over intricate and implicit harm-indicative patterns, utilizing multimodal explanations originating from both harmless and harmful arguments. Extensive experiments on three public meme datasets demonstrate that our harmful meme detection approach achieves much better performance than state-of-the-art methods and exhibits a superior capacity for explaining the meme harmfulness of the model predictions.
Paper Structure (34 sections, 6 equations, 7 figures, 10 tables)

This paper contains 34 sections, 6 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Example of trending memes on social media. Meme text: Surrender their firearms, and the confinement of people in "virus relocation centres".
  • Figure 2: The overall pipeline of our method. We first conduct the multimodal debate between LLMs, to generate the conflicting rationales from the harmless (green) and harmful (lilac) positions. Then the generated rationales are used to train a small task-specific LM judge with multimodal inputs of memes.
  • Figure 3: Examples of correctly predicted harmful memes in (a) Harm-C, (b) Harm-P, and (c) FHM datasets.
  • Figure 4: The performance of our ExplainHM and other multimodal baselines with respect to the parameter size.
  • Figure 5: Prompting LLMs from the harmful argument in the multimodal debate stage, regarding the potential harmfulness label as part of the observed attributes of the textual prompt.
  • ...and 2 more figures