Table of Contents
Fetching ...

MemeMQA: Multimodal Question Answering for Memes via Rationale-Based Inferencing

Siddhant Agarwal, Shivam Sharma, Preslav Nakov, Tanmoy Chakraborty

TL;DR

This work introduces MemeMQA, a multimodal question-answering framework for memes that jointly predicts correct answer entities and generates rationale-based explanations. It builds MemeMQACorpus, an extension of ExHVV with 1,880 meme-question pairs across 1,122 memes, and proposes ARSENAL, a two-stage architecture that leverages multimodal LLMs (e.g., LLaVA-7B, DETR embeddings) to generate generic and answer-specific rationales, then produce concise explanations. Across extensive experiments, ARSENAL outperforms unimodal and multimodal baselines by about $18 ext{ extsuperscript{%}}$ in answer accuracy and demonstrates robust explanatory quality, aided by OCR-augmented inputs and carefully crafted prompting configurations. The study also provides robustness analyses via question diversification and confounding settings, highlighting both the potential and limitations of current multimodal LLMs for nuanced meme interpretation, and outlines directions for safer, more generalizable meme QA systems.

Abstract

Memes have evolved as a prevalent medium for diverse communication, ranging from humour to propaganda. With the rising popularity of image-focused content, there is a growing need to explore its potential harm from different aspects. Previous studies have analyzed memes in closed settings - detecting harm, applying semantic labels, and offering natural language explanations. To extend this research, we introduce MemeMQA, a multimodal question-answering framework aiming to solicit accurate responses to structured questions while providing coherent explanations. We curate MemeMQACorpus, a new dataset featuring 1,880 questions related to 1,122 memes with corresponding answer-explanation pairs. We further propose ARSENAL, a novel two-stage multimodal framework that leverages the reasoning capabilities of LLMs to address MemeMQA. We benchmark MemeMQA using competitive baselines and demonstrate its superiority - ~18% enhanced answer prediction accuracy and distinct text generation lead across various metrics measuring lexical and semantic alignment over the best baseline. We analyze ARSENAL's robustness through diversification of question-set, confounder-based evaluation regarding MemeMQA's generalizability, and modality-specific assessment, enhancing our understanding of meme interpretation in the multimodal communication landscape.

MemeMQA: Multimodal Question Answering for Memes via Rationale-Based Inferencing

TL;DR

This work introduces MemeMQA, a multimodal question-answering framework for memes that jointly predicts correct answer entities and generates rationale-based explanations. It builds MemeMQACorpus, an extension of ExHVV with 1,880 meme-question pairs across 1,122 memes, and proposes ARSENAL, a two-stage architecture that leverages multimodal LLMs (e.g., LLaVA-7B, DETR embeddings) to generate generic and answer-specific rationales, then produce concise explanations. Across extensive experiments, ARSENAL outperforms unimodal and multimodal baselines by about in answer accuracy and demonstrates robust explanatory quality, aided by OCR-augmented inputs and carefully crafted prompting configurations. The study also provides robustness analyses via question diversification and confounding settings, highlighting both the potential and limitations of current multimodal LLMs for nuanced meme interpretation, and outlines directions for safer, more generalizable meme QA systems.

Abstract

Memes have evolved as a prevalent medium for diverse communication, ranging from humour to propaganda. With the rising popularity of image-focused content, there is a growing need to explore its potential harm from different aspects. Previous studies have analyzed memes in closed settings - detecting harm, applying semantic labels, and offering natural language explanations. To extend this research, we introduce MemeMQA, a multimodal question-answering framework aiming to solicit accurate responses to structured questions while providing coherent explanations. We curate MemeMQACorpus, a new dataset featuring 1,880 questions related to 1,122 memes with corresponding answer-explanation pairs. We further propose ARSENAL, a novel two-stage multimodal framework that leverages the reasoning capabilities of LLMs to address MemeMQA. We benchmark MemeMQA using competitive baselines and demonstrate its superiority - ~18% enhanced answer prediction accuracy and distinct text generation lead across various metrics measuring lexical and semantic alignment over the best baseline. We analyze ARSENAL's robustness through diversification of question-set, confounder-based evaluation regarding MemeMQA's generalizability, and modality-specific assessment, enhancing our understanding of meme interpretation in the multimodal communication landscape.
Paper Structure (47 sections, 5 equations, 27 figures, 6 tables)

This paper contains 47 sections, 5 equations, 27 figures, 6 tables.

Figures (27)

  • Figure 1: The MemeMQA task: Given an input meme and multiple choices, identify the correct answer and justify.
  • Figure 2: A schematic diagram showing question-answer construction process in MemeMQACorpus, using entity and role-label information from ExHVV.
  • Figure 3: Description of the prompting setup for free-form synthetic question generation using the LLM, Llama-2-7b-chat. The randomly chosen question option is highlighted in yellow.
  • Figure 4: Comparison of various prompt configurations examined. Bar color scheme -- Green: unifiedqa-t5-base, Magenta: unifiedqa-t5-large, and Blue: t5-large.
  • Figure 5: A schematic diagram of ARSENAL for the MemeMQA task ($\bigoplus$: fusing the information via concatenation).
  • ...and 22 more figures