Table of Contents
Fetching ...

Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models

Junjie Chen, Xuyang Liu, Subin Huang, Linfeng Zhang, Hang Yu

TL;DR

The study tackles the subjective and context dependent nature of multimodal sarcasm by introducing a four-task, multi-perspective evaluation framework for large vision-language models. It systematically prompts 12 LVLMs on $N_d=2409$ MMSD2.0 samples with multiple prompt variants to assess classification, reasoning, and confidence, revealing notable inter- and intra-model variability and a bias toward literal interpretations. The authors show that neutral samples are underrepresented in binary labels and that rationale consistency declines under interpretive tasks, underscoring the need for uncertainty-aware, multi-perspective modeling and richer annotations. A supplementary 100-sample mini-benchmark and human evaluation corroborate the main findings and highlight the practical value of moving beyond binary sarcasm labels toward nuanced, human-aligned interpretation in multimodal sarcasm understanding.

Abstract

With the advent of large vision-language models (LVLMs) demonstrating increasingly human-like abilities, a pivotal question emerges: do different LVLMs interpret multimodal sarcasm differently, and can a single model grasp sarcasm from multiple perspectives like humans? To explore this, we introduce an analytical framework using systematically designed prompts on existing multimodal sarcasm datasets. Evaluating 12 state-of-the-art LVLMs over 2,409 samples, we examine interpretive variations within and across models, focusing on confidence levels, alignment with dataset labels, and recognition of ambiguous "neutral" cases. We further validate our findings on a diverse 100-sample mini-benchmark, incorporating multiple datasets, expanded prompt variants, and representative commercial LVLMs. Our findings reveal notable discrepancies -- across LVLMs and within the same model under varied prompts. While classification-oriented prompts yield higher internal consistency, models diverge markedly when tasked with interpretive reasoning. These results challenge binary labeling paradigms by highlighting sarcasm's subjectivity. We advocate moving beyond rigid annotation schemes toward multi-perspective, uncertainty-aware modeling, offering deeper insights into multimodal sarcasm comprehension. Our code and data are available at: https://github.com/CoderChen01/LVLMSarcasmAnalysis

Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models

TL;DR

The study tackles the subjective and context dependent nature of multimodal sarcasm by introducing a four-task, multi-perspective evaluation framework for large vision-language models. It systematically prompts 12 LVLMs on MMSD2.0 samples with multiple prompt variants to assess classification, reasoning, and confidence, revealing notable inter- and intra-model variability and a bias toward literal interpretations. The authors show that neutral samples are underrepresented in binary labels and that rationale consistency declines under interpretive tasks, underscoring the need for uncertainty-aware, multi-perspective modeling and richer annotations. A supplementary 100-sample mini-benchmark and human evaluation corroborate the main findings and highlight the practical value of moving beyond binary sarcasm labels toward nuanced, human-aligned interpretation in multimodal sarcasm understanding.

Abstract

With the advent of large vision-language models (LVLMs) demonstrating increasingly human-like abilities, a pivotal question emerges: do different LVLMs interpret multimodal sarcasm differently, and can a single model grasp sarcasm from multiple perspectives like humans? To explore this, we introduce an analytical framework using systematically designed prompts on existing multimodal sarcasm datasets. Evaluating 12 state-of-the-art LVLMs over 2,409 samples, we examine interpretive variations within and across models, focusing on confidence levels, alignment with dataset labels, and recognition of ambiguous "neutral" cases. We further validate our findings on a diverse 100-sample mini-benchmark, incorporating multiple datasets, expanded prompt variants, and representative commercial LVLMs. Our findings reveal notable discrepancies -- across LVLMs and within the same model under varied prompts. While classification-oriented prompts yield higher internal consistency, models diverge markedly when tasked with interpretive reasoning. These results challenge binary labeling paradigms by highlighting sarcasm's subjectivity. We advocate moving beyond rigid annotation schemes toward multi-perspective, uncertainty-aware modeling, offering deeper insights into multimodal sarcasm comprehension. Our code and data are available at: https://github.com/CoderChen01/LVLMSarcasmAnalysis

Paper Structure

This paper contains 41 sections, 8 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: An example illustrating the limitations of traditional multimodal sarcasm detection models, which classify text-image pairs without rationale, compared to LVLMs, which offer explicit justifications and diverse interpretative perspectives.
  • Figure 2: Overview of the evaluation framework. Our framework takes as input an evaluation dataset consisting of $N_d$ text-image pairs with predefined labels and prompts $N_m$ LVLMs to perform four distinct tasks, each with $N_p$ prompt variants. This results in $4 \times N_d \times N_m \times N_p$ evaluation outputs, which are systematically analyzed through quantitative and qualitative assessments.
  • Figure 3: Example prompts for each task. For each task prompt, we first provide a task description, followed by explicit instructions for the analysis steps, and conclude by specifying the required output format. In the figure, the descriptions of analysis steps and output requirements are omitted for brevity. Detailed prompts can be found in Appendix \ref{['app:prompt-details']}. {{IMAGE}} and {{TEXT}} represent the input image and text, respectively.
  • Figure 4: Classification consistency heatmap. This figure illustrates the consistency scores of classification results from different prompt variants across four tasks for each model. Specifically, the gray blocks indicate that the Krippendorff’s $\alpha$ calculation for int2-8B in the BSC task is invalid. Further inspection reveals that int2-8B classifies all samples as sarcastic across prompt variants, preventing the computation of Krippendorff’s $\alpha$.
  • Figure 5: Rationale consistency score. This figure displays the average similarity of rationales for the same classification across all samples, tasks, and models, where different prompt variants were used.
  • ...and 11 more figures