Table of Contents
Fetching ...

M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop Chain-of-Thought

Gitanjali Kumari, Kirtan Jain, Asif Ekbal

TL;DR

This work tackles misogynous meme detection by introducing M3Hop-CoT, a multimodal, multimodal multi-hop Chain-of-Thought framework that fuses meme text, image features, and scene-graph derived entity-object relationships. By prompting an LLM to generate emotion, target, and context rationales in a three-hop sequence and integrating these via hierarchical cross-attention, the model captures nuanced cues often missed by unimodal or non-CoT approaches. Empirical results on SemEval-2022 MAMI and MIMIC datasets show state-of-the-art macro-F1 performance, with strong generalization to Hateful Memes, Memotion2, and Harmful Memes, and ablation analyses confirming the contribution of each component. The approach demonstrates the practical value of culturally aware, rationale-guided multimodal reasoning for safer online content moderation and sheds light on the importance of scene semantics and psycholinguistic factors in meme interpretation.

Abstract

In recent years, there has been a significant rise in the phenomenon of hate against women on social media platforms, particularly through the use of misogynous memes. These memes often target women with subtle and obscure cues, making their detection a challenging task for automated systems. Recently, Large Language Models (LLMs) have shown promising results in reasoning using Chain-of-Thought (CoT) prompting to generate the intermediate reasoning chains as the rationale to facilitate multimodal tasks, but often neglect cultural diversity and key aspects like emotion and contextual knowledge hidden in the visual modalities. To address this gap, we introduce a Multimodal Multi-hop CoT (M3Hop-CoT) framework for Misogynous meme identification, combining a CLIP-based classifier and a multimodal CoT module with entity-object-relationship integration. M3Hop-CoT employs a three-step multimodal prompting principle to induce emotions, target awareness, and contextual knowledge for meme analysis. Our empirical evaluation, including both qualitative and quantitative analysis, validates the efficacy of the M3Hop-CoT framework on the SemEval-2022 Task 5 (MAMI task) dataset, highlighting its strong performance in the macro-F1 score. Furthermore, we evaluate the model's generalizability by evaluating it on various benchmark meme datasets, offering a thorough insight into the effectiveness of our approach across different datasets.

M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop Chain-of-Thought

TL;DR

This work tackles misogynous meme detection by introducing M3Hop-CoT, a multimodal, multimodal multi-hop Chain-of-Thought framework that fuses meme text, image features, and scene-graph derived entity-object relationships. By prompting an LLM to generate emotion, target, and context rationales in a three-hop sequence and integrating these via hierarchical cross-attention, the model captures nuanced cues often missed by unimodal or non-CoT approaches. Empirical results on SemEval-2022 MAMI and MIMIC datasets show state-of-the-art macro-F1 performance, with strong generalization to Hateful Memes, Memotion2, and Harmful Memes, and ablation analyses confirming the contribution of each component. The approach demonstrates the practical value of culturally aware, rationale-guided multimodal reasoning for safer online content moderation and sheds light on the importance of scene semantics and psycholinguistic factors in meme interpretation.

Abstract

In recent years, there has been a significant rise in the phenomenon of hate against women on social media platforms, particularly through the use of misogynous memes. These memes often target women with subtle and obscure cues, making their detection a challenging task for automated systems. Recently, Large Language Models (LLMs) have shown promising results in reasoning using Chain-of-Thought (CoT) prompting to generate the intermediate reasoning chains as the rationale to facilitate multimodal tasks, but often neglect cultural diversity and key aspects like emotion and contextual knowledge hidden in the visual modalities. To address this gap, we introduce a Multimodal Multi-hop CoT (M3Hop-CoT) framework for Misogynous meme identification, combining a CLIP-based classifier and a multimodal CoT module with entity-object-relationship integration. M3Hop-CoT employs a three-step multimodal prompting principle to induce emotions, target awareness, and contextual knowledge for meme analysis. Our empirical evaluation, including both qualitative and quantitative analysis, validates the efficacy of the M3Hop-CoT framework on the SemEval-2022 Task 5 (MAMI task) dataset, highlighting its strong performance in the macro-F1 score. Furthermore, we evaluate the model's generalizability by evaluating it on various benchmark meme datasets, offering a thorough insight into the effectiveness of our approach across different datasets.

Paper Structure

This paper contains 41 sections, 13 equations, 22 figures, 7 tables.

Figures (22)

  • Figure 1: Comparison between (a) fine-tuning visual language model approach and (b) Chain-of-Thought based approach.
  • Figure 2: Illustration of the proposed M3Hop-CoT model.
  • Figure 3: Misclassification rate comparison between proposed model M3Hop-CoT and their various variants
  • Figure 4: Categorization of error analysis (%) of proposed model M3Hop-CoT and other SOTA models
  • Figure 5: Case studies comparing the attention-maps for the baseline CLIP_MM and the proposed model M3Hop-CoT using Grad-CAM, LIME DBLP:journals/corr/RibeiroSG16, and Integrated Gradient DBLP:journals/corr/SundararajanTY17 on the MAMI dataset test samples. Here, T and V are the normalized textual and visual contribution scores in the final prediction using Integrated Gradient.
  • ...and 17 more figures