Meme-ingful Analysis: Enhanced Understanding of Cyberbullying in Memes Through Multimodal Explanations
Prince Jha, Krishanu Maity, Raghav Jain, Apoorv Verma, Sriparna Saha, Pushpak Bhattacharyya
TL;DR
This work tackles explainability in multimodal, code-mixed cyberbullying memes by introducing MultiBully-Ex and the MExCCM task, which require both textual rationales and visual evidence. It proposes a CLIP projection-based, shared-private multitask architecture with three components: a Cross-Modal Neck, a Vision-Informed Textual Seq2Seq model, and a Linguistically-Sensitive Visual Segmentation model, augmented by a loss-prioritization scheme. Empirical results show that multimodal, multitask models outperform single-task and unimodal baselines on both textual and visual explainability, with human evaluations indicating high relevance for generated rationales. This advances interpretable meme moderation by combining robust multimodal representations with targeted explainability, and points to future work on stereotype detection and cross-language generalization.
Abstract
Internet memes have gained significant influence in communicating political, psychological, and sociocultural ideas. While memes are often humorous, there has been a rise in the use of memes for trolling and cyberbullying. Although a wide variety of effective deep learning-based models have been developed for detecting offensive multimodal memes, only a few works have been done on explainability aspect. Recent laws like "right to explanations" of General Data Protection Regulation, have spurred research in developing interpretable models rather than only focusing on performance. Motivated by this, we introduce {\em MultiBully-Ex}, the first benchmark dataset for multimodal explanation from code-mixed cyberbullying memes. Here, both visual and textual modalities are highlighted to explain why a given meme is cyberbullying. A Contrastive Language-Image Pretraining (CLIP) projection-based multimodal shared-private multitask approach has been proposed for visual and textual explanation of a meme. Experimental results demonstrate that training with multimodal explanations improves performance in generating textual justifications and more accurately identifying the visual evidence supporting a decision with reliable performance improvements.
