Table of Contents
Fetching ...

Multi-Granular Multimodal Clue Fusion for Meme Understanding

Li Zheng, Hao Fei, Ting Dai, Zuquan Peng, Fei Li, Huisheng Ma, Chong Teng, Donghong Ji

TL;DR

MMU aims to predict metaphor, sentiment, intention, and offensiveness from memes and is challenged by two bottlenecks: loss of fine-grained visual metaphor cues and weak cross-modal correlations. The authors propose MGMCF, which combines object-level semantic mining for fine-grained image clues, a global-local cross-modal interaction with symmetric cross-attention, and a dual-semantic guided training objective to align multimodal representations. The method achieves state-of-the-art results on the MET-MEME bilingual dataset, with significant gains in offensiveness precision and cross-task accuracies across English and Chinese memes, demonstrating robust multi-granular meme understanding and cross-modal alignment. Overall, MGMCF advances MMU by enabling finer visual reasoning and more reliable text-image integration, with potential for broader, multilingual meme analysis and downstream applications.

Abstract

With the continuous emergence of various social media platforms frequently used in daily life, the multimodal meme understanding (MMU) task has been garnering increasing attention. MMU aims to explore and comprehend the meanings of memes from various perspectives by performing tasks such as metaphor recognition, sentiment analysis, intention detection, and offensiveness detection. Despite making progress, limitations persist due to the loss of fine-grained metaphorical visual clue and the neglect of multimodal text-image weak correlation. To overcome these limitations, we propose a multi-granular multimodal clue fusion model (MGMCF) to advance MMU. Firstly, we design an object-level semantic mining module to extract object-level image feature clues, achieving fine-grained feature clue extraction and enhancing the model's ability to capture metaphorical details and semantics. Secondly, we propose a brand-new global-local cross-modal interaction model to address the weak correlation between text and images. This model facilitates effective interaction between global multimodal contextual clues and local unimodal feature clues, strengthening their representations through a bidirectional cross-modal attention mechanism. Finally, we devise a dual-semantic guided training strategy to enhance the model's understanding and alignment of multimodal representations in the semantic space. Experiments conducted on the widely-used MET-MEME bilingual dataset demonstrate significant improvements over state-of-the-art baselines. Specifically, there is an 8.14% increase in precision for offensiveness detection task, and respective accuracy enhancements of 3.53%, 3.89%, and 3.52% for metaphor recognition, sentiment analysis, and intention detection tasks. These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing MMU.

Multi-Granular Multimodal Clue Fusion for Meme Understanding

TL;DR

MMU aims to predict metaphor, sentiment, intention, and offensiveness from memes and is challenged by two bottlenecks: loss of fine-grained visual metaphor cues and weak cross-modal correlations. The authors propose MGMCF, which combines object-level semantic mining for fine-grained image clues, a global-local cross-modal interaction with symmetric cross-attention, and a dual-semantic guided training objective to align multimodal representations. The method achieves state-of-the-art results on the MET-MEME bilingual dataset, with significant gains in offensiveness precision and cross-task accuracies across English and Chinese memes, demonstrating robust multi-granular meme understanding and cross-modal alignment. Overall, MGMCF advances MMU by enabling finer visual reasoning and more reliable text-image integration, with potential for broader, multilingual meme analysis and downstream applications.

Abstract

With the continuous emergence of various social media platforms frequently used in daily life, the multimodal meme understanding (MMU) task has been garnering increasing attention. MMU aims to explore and comprehend the meanings of memes from various perspectives by performing tasks such as metaphor recognition, sentiment analysis, intention detection, and offensiveness detection. Despite making progress, limitations persist due to the loss of fine-grained metaphorical visual clue and the neglect of multimodal text-image weak correlation. To overcome these limitations, we propose a multi-granular multimodal clue fusion model (MGMCF) to advance MMU. Firstly, we design an object-level semantic mining module to extract object-level image feature clues, achieving fine-grained feature clue extraction and enhancing the model's ability to capture metaphorical details and semantics. Secondly, we propose a brand-new global-local cross-modal interaction model to address the weak correlation between text and images. This model facilitates effective interaction between global multimodal contextual clues and local unimodal feature clues, strengthening their representations through a bidirectional cross-modal attention mechanism. Finally, we devise a dual-semantic guided training strategy to enhance the model's understanding and alignment of multimodal representations in the semantic space. Experiments conducted on the widely-used MET-MEME bilingual dataset demonstrate significant improvements over state-of-the-art baselines. Specifically, there is an 8.14% increase in precision for offensiveness detection task, and respective accuracy enhancements of 3.53%, 3.89%, and 3.52% for metaphor recognition, sentiment analysis, and intention detection tasks. These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing MMU.

Paper Structure

This paper contains 17 sections, 11 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Examples of Metaphorical Memes.
  • Figure 2: The overall architecture of our model. "mr" means metaphor recognition, "sa" means sentiment analysis, "id" means intention detection, "od" means offensiveness detection.
  • Figure 3: Comparative results between global-local and local-local interaction on the MET-MEME English dataset.
  • Figure 4: Visualization of a typical example.
  • Figure 5: Influence of unimodal prediction on the MET-MEME English dataset.
  • ...and 1 more figures