Table of Contents
Fetching ...

Meme-ingful Analysis: Enhanced Understanding of Cyberbullying in Memes Through Multimodal Explanations

Prince Jha, Krishanu Maity, Raghav Jain, Apoorv Verma, Sriparna Saha, Pushpak Bhattacharyya

TL;DR

This work tackles explainability in multimodal, code-mixed cyberbullying memes by introducing MultiBully-Ex and the MExCCM task, which require both textual rationales and visual evidence. It proposes a CLIP projection-based, shared-private multitask architecture with three components: a Cross-Modal Neck, a Vision-Informed Textual Seq2Seq model, and a Linguistically-Sensitive Visual Segmentation model, augmented by a loss-prioritization scheme. Empirical results show that multimodal, multitask models outperform single-task and unimodal baselines on both textual and visual explainability, with human evaluations indicating high relevance for generated rationales. This advances interpretable meme moderation by combining robust multimodal representations with targeted explainability, and points to future work on stereotype detection and cross-language generalization.

Abstract

Internet memes have gained significant influence in communicating political, psychological, and sociocultural ideas. While memes are often humorous, there has been a rise in the use of memes for trolling and cyberbullying. Although a wide variety of effective deep learning-based models have been developed for detecting offensive multimodal memes, only a few works have been done on explainability aspect. Recent laws like "right to explanations" of General Data Protection Regulation, have spurred research in developing interpretable models rather than only focusing on performance. Motivated by this, we introduce {\em MultiBully-Ex}, the first benchmark dataset for multimodal explanation from code-mixed cyberbullying memes. Here, both visual and textual modalities are highlighted to explain why a given meme is cyberbullying. A Contrastive Language-Image Pretraining (CLIP) projection-based multimodal shared-private multitask approach has been proposed for visual and textual explanation of a meme. Experimental results demonstrate that training with multimodal explanations improves performance in generating textual justifications and more accurately identifying the visual evidence supporting a decision with reliable performance improvements.

Meme-ingful Analysis: Enhanced Understanding of Cyberbullying in Memes Through Multimodal Explanations

TL;DR

This work tackles explainability in multimodal, code-mixed cyberbullying memes by introducing MultiBully-Ex and the MExCCM task, which require both textual rationales and visual evidence. It proposes a CLIP projection-based, shared-private multitask architecture with three components: a Cross-Modal Neck, a Vision-Informed Textual Seq2Seq model, and a Linguistically-Sensitive Visual Segmentation model, augmented by a loss-prioritization scheme. Empirical results show that multimodal, multitask models outperform single-task and unimodal baselines on both textual and visual explainability, with human evaluations indicating high relevance for generated rationales. This advances interpretable meme moderation by combining robust multimodal representations with targeted explainability, and points to future work on stereotype detection and cross-language generalization.

Abstract

Internet memes have gained significant influence in communicating political, psychological, and sociocultural ideas. While memes are often humorous, there has been a rise in the use of memes for trolling and cyberbullying. Although a wide variety of effective deep learning-based models have been developed for detecting offensive multimodal memes, only a few works have been done on explainability aspect. Recent laws like "right to explanations" of General Data Protection Regulation, have spurred research in developing interpretable models rather than only focusing on performance. Motivated by this, we introduce {\em MultiBully-Ex}, the first benchmark dataset for multimodal explanation from code-mixed cyberbullying memes. Here, both visual and textual modalities are highlighted to explain why a given meme is cyberbullying. A Contrastive Language-Image Pretraining (CLIP) projection-based multimodal shared-private multitask approach has been proposed for visual and textual explanation of a meme. Experimental results demonstrate that training with multimodal explanations improves performance in generating textual justifications and more accurately identifying the visual evidence supporting a decision with reliable performance improvements.
Paper Structure (30 sections, 9 equations, 6 figures, 3 tables)

This paper contains 30 sections, 9 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Cyberbullying Explanation in memes. Here the aim is to highlight both the image and text as an explanation of why the given meme is a bully.
  • Figure 2: CLIP projection-based (CP) multimodal shared-private multitask architecture. The Vision-Informed Textual Seq2Seq model is represented by a pink dotted box. The Cross Modal Projection Neck is signified by a blue dotted box. The Linguistically Sensitive Visual Segmentation model is indicated by a red dotted box. Lx denotes number of transformer layers
  • Figure 3: Human annotation vs. proposed model's visual and textual explanations; Green highlights indicate an agreement between the human annotator and the model. Red highlighted tokens are predicted by models, not by human annotators.
  • Figure 4: Distribution for Length of Meme Text
  • Figure 5: Distribution for Length of Annotated Rationales
  • ...and 1 more figures