CofiPara: A Coarse-to-fine Paradigm for Multimodal Sarcasm Target Identification with Large Multimodal Models

Hongzhan Lin; Zixin Chen; Ziyang Luo; Mingfei Cheng; Jing Ma; Guang Chen

CofiPara: A Coarse-to-fine Paradigm for Multimodal Sarcasm Target Identification with Large Multimodal Models

Hongzhan Lin, Zixin Chen, Ziyang Luo, Mingfei Cheng, Jing Ma, Guang Chen

TL;DR

The paper tackles the challenging task of Multimodal Sarcasm Target Identification (MSTI) by introducing CofiPara, a coarse-to-fine framework that leverages Divergent Thinking with Large Multimodal Models (LMMs) and a coarser-grained Multimodal Sarcasm Detection (MSD) pre-training stage. It first generates competing rationales from LMMs to guide a smaller language model through a robust pre-training on MSD, then fine-tunes this model to identify textual and visual sarcasm targets in MSTI with specialized cross-attention decoders and multi-task losses. Empirical results on MMSD2.0 and MSTI2.0 show that CofiPara beats state-of-the-art MSTI and MSD baselines, with especially large gains in visual target detection (AP50) and notable improvements in textual target identification (EM). The framework also provides enhanced explainability via LMM-generated rationales, supporting better interpretability and human verification, while ablation studies confirm the importance of MSD pre-training and LMM reasoning in achieving these gains. Overall, CofiPara offers a general, explainable approach to multimodal sarcasm understanding that can adapt to stronger LMMs and broader sarcasm-related tasks in the future.

Abstract

Social media abounds with multimodal sarcasm, and identifying sarcasm targets is particularly challenging due to the implicit incongruity not directly evident in the text and image modalities. Current methods for Multimodal Sarcasm Target Identification (MSTI) predominantly focus on superficial indicators in an end-to-end manner, overlooking the nuanced understanding of multimodal sarcasm conveyed through both the text and image. This paper proposes a versatile MSTI framework with a coarse-to-fine paradigm, by augmenting sarcasm explainability with reasoning and pre-training knowledge. Inspired by the powerful capacity of Large Multimodal Models (LMMs) on multimodal reasoning, we first engage LMMs to generate competing rationales for coarser-grained pre-training of a small language model on multimodal sarcasm detection. We then propose fine-tuning the model for finer-grained sarcasm target identification. Our framework is thus empowered to adeptly unveil the intricate targets within multimodal sarcasm and mitigate the negative impact posed by potential noise inherently in LMMs. Experimental results demonstrate that our model far outperforms state-of-the-art MSTI methods, and markedly exhibits explainability in deciphering sarcasm as well.

CofiPara: A Coarse-to-fine Paradigm for Multimodal Sarcasm Target Identification with Large Multimodal Models

TL;DR

Abstract

Paper Structure (29 sections, 10 equations, 7 figures, 18 tables, 1 algorithm)

This paper contains 29 sections, 10 equations, 7 figures, 18 tables, 1 algorithm.

Introduction
Related Work
Our Approach
Divergent Thinking with LMM
Coarser-Grained Pre-Training
Finer-Grained Fine-Tuning
Model Training.
Experiments
Experimental Setup
Main Results
Ablation Study of Target Identification
Case Study of Explainability
Conclusion and Future Work
Datasets
Baselines
...and 14 more sections

Figures (7)

Figure 1: Examples of multimodal sarcasm on Twitter: (a) "never seen a #dlr train driver before. looks like a tough job #london"; (b) "thank god for no product placement in #ukraine #eurovision". Boxes in green and words in red denote the visual and textual targets.
Figure 2: An overview of our framework, CofiPara, for multimodal sarcasm target identification.
Figure 3: Examples of correctly identified samples.
Figure 4: An example of re-annotated samples.
Figure 5: Examples of wrongly identified samples by our proposed CofiPara framework for MSTI.
...and 2 more figures

CofiPara: A Coarse-to-fine Paradigm for Multimodal Sarcasm Target Identification with Large Multimodal Models

TL;DR

Abstract

CofiPara: A Coarse-to-fine Paradigm for Multimodal Sarcasm Target Identification with Large Multimodal Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)