Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models
Junjie Chen, Xuyang Liu, Subin Huang, Linfeng Zhang, Hang Yu
TL;DR
The study tackles the subjective and context dependent nature of multimodal sarcasm by introducing a four-task, multi-perspective evaluation framework for large vision-language models. It systematically prompts 12 LVLMs on $N_d=2409$ MMSD2.0 samples with multiple prompt variants to assess classification, reasoning, and confidence, revealing notable inter- and intra-model variability and a bias toward literal interpretations. The authors show that neutral samples are underrepresented in binary labels and that rationale consistency declines under interpretive tasks, underscoring the need for uncertainty-aware, multi-perspective modeling and richer annotations. A supplementary 100-sample mini-benchmark and human evaluation corroborate the main findings and highlight the practical value of moving beyond binary sarcasm labels toward nuanced, human-aligned interpretation in multimodal sarcasm understanding.
Abstract
With the advent of large vision-language models (LVLMs) demonstrating increasingly human-like abilities, a pivotal question emerges: do different LVLMs interpret multimodal sarcasm differently, and can a single model grasp sarcasm from multiple perspectives like humans? To explore this, we introduce an analytical framework using systematically designed prompts on existing multimodal sarcasm datasets. Evaluating 12 state-of-the-art LVLMs over 2,409 samples, we examine interpretive variations within and across models, focusing on confidence levels, alignment with dataset labels, and recognition of ambiguous "neutral" cases. We further validate our findings on a diverse 100-sample mini-benchmark, incorporating multiple datasets, expanded prompt variants, and representative commercial LVLMs. Our findings reveal notable discrepancies -- across LVLMs and within the same model under varied prompts. While classification-oriented prompts yield higher internal consistency, models diverge markedly when tasked with interpretive reasoning. These results challenge binary labeling paradigms by highlighting sarcasm's subjectivity. We advocate moving beyond rigid annotation schemes toward multi-perspective, uncertainty-aware modeling, offering deeper insights into multimodal sarcasm comprehension. Our code and data are available at: https://github.com/CoderChen01/LVLMSarcasmAnalysis
