Emotion-Aware Multimodal Fusion for Meme Emotion Detection

Shivam Sharma; Ramaneswaran S; Md. Shad Akhtar; Tanmoy Chakraborty

Emotion-Aware Multimodal Fusion for Meme Emotion Detection

Shivam Sharma, Ramaneswaran S, Md. Shad Akhtar, Tanmoy Chakraborty

TL;DR

The paper tackles meme emotion detection by advocating true multimodal grounding of text and visuals. It introduces MOOD, a six Ekman emotion meme dataset, and ALFRED, a framework that explicitly models emotion-enriched visual cues via gate-based fusion and cross-attention. Empirical results show ALFRED surpassing strong baselines on MOOD by 4.94% F1 and performing competitively on Memotion, with strong generalization to HarMeme and Dank Memes and interpretable attention maps. The work advances practical meme analysis with improved interpretability and cross-domain applicability, while acknowledging challenges in visual obscurity and lexical ambiguity.

Abstract

The ever-evolving social media discourse has witnessed an overwhelming use of memes to express opinions or dissent. Besides being misused for spreading malcontent, they are mined by corporations and political parties to glean the public's opinion. Therefore, memes predominantly offer affect-enriched insights towards ascertaining the societal psyche. However, the current approaches are yet to model the affective dimensions expressed in memes effectively. They rely extensively on large multimodal datasets for pre-training and do not generalize well due to constrained visual-linguistic grounding. In this paper, we introduce MOOD (Meme emOtiOns Dataset), which embodies six basic emotions. We then present ALFRED (emotion-Aware muLtimodal Fusion foR Emotion Detection), a novel multimodal neural framework that (i) explicitly models emotion-enriched visual cues, and (ii) employs an efficient cross-modal fusion via a gating mechanism. Our investigation establishes ALFRED's superiority over existing baselines by 4.94% F1. Additionally, ALFRED competes strongly with previous best approaches on the challenging Memotion task. We then discuss ALFRED's domain-agnostic generalizability by demonstrating its dominance on two recently-released datasets - HarMeme and Dank Memes, over other baselines. Further, we analyze ALFRED's interpretability using attention maps. Finally, we highlight the inherent challenges posed by the complex interplay of disparate modality-specific cues toward meme analysis.

Emotion-Aware Multimodal Fusion for Meme Emotion Detection

TL;DR

Abstract

Paper Structure (30 sections, 6 equations, 15 figures, 7 tables)

This paper contains 30 sections, 6 equations, 15 figures, 7 tables.

Introduction
Related Work
Meme emOtiOns Dataset (MOOD)
Dataset collection and de-duplication
Text Length Analysis
Annotation
Lexical Analysis of MOOD
Proposed Approach
Unimodal Feature Extraction
Emotion Feature Extraction
Gated Multimodal Fusion (GMF)
Gated Cross Attention (GCA)
Prediction and Training Objective
Baseline Models
Experiments
...and 15 more sections

Figures (15)

Figure 1: Example of memes: (a,b) text modality is misleading; (c,d) same image but emotion being differentiated by the text.
Figure 2: Examples depicting memes for six basic Ekman emotions from our dataset, MOOD.
Figure 3: Normalized histograms of meme text length per class.
Figure 4: Examples of the images added to extend AffectNet dataset, especially towards capturing the non-human subjects within the meme visuals like, (a) fearful 'Spongebob', (b) joyous cat and (c) sad puppy, depicted in subfig. (a), (b) and (c), respectively.
Figure 5: Word clouds depicting the category-wise lexicon comprising the embedded texts for memes in the MOOD dataset.
...and 10 more figures

Emotion-Aware Multimodal Fusion for Meme Emotion Detection

TL;DR

Abstract

Emotion-Aware Multimodal Fusion for Meme Emotion Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (15)