Table of Contents
Fetching ...

I Came, I Saw, I Explained: Benchmarking Multimodal LLMs on Figurative Meaning in Memes

Shijia Zhou, Saif M. Mohammad, Barbara Plank, Diego Frassinelli

Abstract

Internet memes represent a popular form of multimodal online communication and often use figurative elements to convey layered meaning through the combination of text and images. However, it remains largely unclear how multimodal large language models (MLLMs) combine and interpret visual and textual information to identify figurative meaning in memes. To address this gap, we evaluate eight state-of-the-art generative MLLMs across three datasets on their ability to detect and explain six types of figurative meaning. In addition, we conduct a human evaluation of the explanations generated by these MLLMs, assessing whether the provided reasoning supports the predicted label and whether it remains faithful to the original meme content. Our findings indicate that all models exhibit a strong bias to associate a meme with figurative meaning, even when no such meaning is present. Qualitative analysis further shows that correct predictions are not always accompanied by faithful explanations.

I Came, I Saw, I Explained: Benchmarking Multimodal LLMs on Figurative Meaning in Memes

Abstract

Internet memes represent a popular form of multimodal online communication and often use figurative elements to convey layered meaning through the combination of text and images. However, it remains largely unclear how multimodal large language models (MLLMs) combine and interpret visual and textual information to identify figurative meaning in memes. To address this gap, we evaluate eight state-of-the-art generative MLLMs across three datasets on their ability to detect and explain six types of figurative meaning. In addition, we conduct a human evaluation of the explanations generated by these MLLMs, assessing whether the provided reasoning supports the predicted label and whether it remains faithful to the original meme content. Our findings indicate that all models exhibit a strong bias to associate a meme with figurative meaning, even when no such meaning is present. Qualitative analysis further shows that correct predictions are not always accompanied by faithful explanations.
Paper Structure (37 sections, 10 figures, 8 tables)

This paper contains 37 sections, 10 figures, 8 tables.

Figures (10)

  • Figure 1: A meme from MET-Memexu2022METMeme playfully adapts Caesar’s famous quote "I came, I saw, I conquered". In our work, we investigate the respective contributions of image and text to models' prediction of figurative meaning in memes.
  • Figure 2: Performance of Aya-32B, Gemma-27B, and Qwen-72B on the multi-label classification task (only in FigMemes) across modalities and meme groups. Error bars indicate standard deviation over 5 runs. Top row: Count of memes with 0, 1, or 2+ figurative types (0, 1, 2+ gold labels) that are predicted to contain from 0 to 6 figurative types. Bottom row: Count of memes containing 0, 1, or 2+ figurative types that are fully correctly predicted, where all six predicted labels match the gold labels.
  • Figure 3: Human evaluation results on model-generated explanations.
  • Figure 4: Examples of memes for which MLLMs struggle to generate high-quality explanations.
  • Figure 5: Two memes from Memotion 2 with gold label "not sarcastic".
  • ...and 5 more figures