Enhancing Meme Emotion Understanding with Multi-Level Modality Enhancement and Dual-Stage Modal Fusion
Yi Shi, Wenlong Meng, Zhenyuan Guo, Chengkun Wei, Wenzhi Chen
TL;DR
This work tackles Meme Emotion Understanding (MEU) by addressing two key challenges: fine-grained cross-modal fusion and mining implicit meme meanings. It introduces MemoDetector, which combines four-step textual enhancement using Multimodal Large Language Models (MLLMs)—encompassing Image Description ($ID$), Text Meaning ($TM$), Combined Implicit Meaning ($CIM$), and Context Analysis ($CA$)—with a dual-stage fusion framework that first fuses raw image and text and then deeply fuses enhanced visual and textual features via bidirectional cross-attention. The approach yields state-of-the-art results on MET-MEME and MOOD, with improvements of 4.17%/4.3% in accuracy/Macro-F1 on MET-MEME and 4.04%/3.4% on MOOD, validated through comprehensive ablations. These findings demonstrate that enriching memes with structured, multi-level textual reasoning and hierarchical fusion significantly improves MEU and offers a practical path for robust meme understanding in multimodal AI systems.
Abstract
With the rapid rise of social media and Internet culture, memes have become a popular medium for expressing emotional tendencies. This has sparked growing interest in Meme Emotion Understanding (MEU), which aims to classify the emotional intent behind memes by leveraging their multimodal contents. While existing efforts have achieved promising results, two major challenges remain: (1) a lack of fine-grained multimodal fusion strategies, and (2) insufficient mining of memes' implicit meanings and background knowledge. To address these challenges, we propose MemoDetector, a novel framework for advancing MEU. First, we introduce a four-step textual enhancement module that utilizes the rich knowledge and reasoning capabilities of Multimodal Large Language Models (MLLMs) to progressively infer and extract implicit and contextual insights from memes. These enhanced texts significantly enrich the original meme contents and provide valuable guidance for downstream classification. Next, we design a dual-stage modal fusion strategy: the first stage performs shallow fusion on raw meme image and text, while the second stage deeply integrates the enhanced visual and textual features. This hierarchical fusion enables the model to better capture nuanced cross-modal emotional cues. Experiments on two datasets, MET-MEME and MOOD, demonstrate that our method consistently outperforms state-of-the-art baselines. Specifically, MemoDetector improves F1 scores by 4.3\% on MET-MEME and 3.4\% on MOOD. Further ablation studies and in-depth analyses validate the effectiveness and robustness of our approach, highlighting its strong potential for advancing MEU. Our code is available at https://github.com/singing-cat/MemoDetector.
