MemeMQA: Multimodal Question Answering for Memes via Rationale-Based Inferencing
Siddhant Agarwal, Shivam Sharma, Preslav Nakov, Tanmoy Chakraborty
TL;DR
This work introduces MemeMQA, a multimodal question-answering framework for memes that jointly predicts correct answer entities and generates rationale-based explanations. It builds MemeMQACorpus, an extension of ExHVV with 1,880 meme-question pairs across 1,122 memes, and proposes ARSENAL, a two-stage architecture that leverages multimodal LLMs (e.g., LLaVA-7B, DETR embeddings) to generate generic and answer-specific rationales, then produce concise explanations. Across extensive experiments, ARSENAL outperforms unimodal and multimodal baselines by about $18 ext{ extsuperscript{%}}$ in answer accuracy and demonstrates robust explanatory quality, aided by OCR-augmented inputs and carefully crafted prompting configurations. The study also provides robustness analyses via question diversification and confounding settings, highlighting both the potential and limitations of current multimodal LLMs for nuanced meme interpretation, and outlines directions for safer, more generalizable meme QA systems.
Abstract
Memes have evolved as a prevalent medium for diverse communication, ranging from humour to propaganda. With the rising popularity of image-focused content, there is a growing need to explore its potential harm from different aspects. Previous studies have analyzed memes in closed settings - detecting harm, applying semantic labels, and offering natural language explanations. To extend this research, we introduce MemeMQA, a multimodal question-answering framework aiming to solicit accurate responses to structured questions while providing coherent explanations. We curate MemeMQACorpus, a new dataset featuring 1,880 questions related to 1,122 memes with corresponding answer-explanation pairs. We further propose ARSENAL, a novel two-stage multimodal framework that leverages the reasoning capabilities of LLMs to address MemeMQA. We benchmark MemeMQA using competitive baselines and demonstrate its superiority - ~18% enhanced answer prediction accuracy and distinct text generation lead across various metrics measuring lexical and semantic alignment over the best baseline. We analyze ARSENAL's robustness through diversification of question-set, confounder-based evaluation regarding MemeMQA's generalizability, and modality-specific assessment, enhancing our understanding of meme interpretation in the multimodal communication landscape.
