Table of Contents
Fetching ...

InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection

Junjie Chen, Hang Yu, Subin Huang, Sanmin Liu, Linfeng Zhang

TL;DR

InterCLIP-MEP tackles multi-modal sarcasm detection by combining Interactive CLIP (InterCLIP), which injects cross-modal information into text and vision encoders, with a Memory-Enhanced Predictor (MEP) that dynamically preserves low-entropy test exemplars for non-parametric inference. The approach employs an efficient training strategy using LoRA to fine-tune only top layers, reducing trainable parameters by approximately $20.6\times$ while maintaining or exceeding SOTA performance across MMSD, MMSD2.0, and DocMSU. Empirical results show improvements of around $1.08\%$ in accuracy and $1.51\%$ in F1 on MMSD2.0, and strong robustness under distributional shift (up to $73.96\%$ accuracy) thanks to memory-based inference. The work also demonstrates efficiency benefits, case studies, and ablations that validate the importance of cross-modal interaction and memory in handling subtle sarcasm cues, offering a practical, scalable solution for real-world multi-modal sarcasm detection.

Abstract

Sarcasm in social media, frequently conveyed through the interplay of text and images, presents significant challenges for sentiment analysis and intention mining. Existing multi-modal sarcasm detection approaches have been shown to excessively depend on superficial cues within the textual modality, exhibiting limited capability to accurately discern sarcasm through subtle text-image interactions. To address this limitation, a novel framework, InterCLIP-MEP, is proposed. This framework integrates Interactive CLIP (InterCLIP), which employs an efficient training strategy to derive enriched cross-modal representations by embedding inter-modal information directly into each encoder, while using approximately 20.6$\times$ fewer trainable parameters compared with existing state-of-the-art (SOTA) methods. Furthermore, a Memory-Enhanced Predictor (MEP) is introduced, featuring a dynamic dual-channel memory mechanism that captures and retains valuable knowledge from test samples during inference, serving as a non-parametric classifier to enhance sarcasm detection robustness. Extensive experiments on MMSD, MMSD2.0, and DocMSU show that InterCLIP-MEP achieves SOTA performance, specifically improving accuracy by 1.08% and F1 score by 1.51% on MMSD2.0. Under distributional shift evaluation, it attains 73.96% accuracy, exceeding its memory-free variant by nearly 10% and the previous SOTA by over 15%, demonstrating superior stability and adaptability. The implementation of InterCLIP-MEP is publicly available at https://github.com/CoderChen01/InterCLIP-MEP.

InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection

TL;DR

InterCLIP-MEP tackles multi-modal sarcasm detection by combining Interactive CLIP (InterCLIP), which injects cross-modal information into text and vision encoders, with a Memory-Enhanced Predictor (MEP) that dynamically preserves low-entropy test exemplars for non-parametric inference. The approach employs an efficient training strategy using LoRA to fine-tune only top layers, reducing trainable parameters by approximately while maintaining or exceeding SOTA performance across MMSD, MMSD2.0, and DocMSU. Empirical results show improvements of around in accuracy and in F1 on MMSD2.0, and strong robustness under distributional shift (up to accuracy) thanks to memory-based inference. The work also demonstrates efficiency benefits, case studies, and ablations that validate the importance of cross-modal interaction and memory in handling subtle sarcasm cues, offering a practical, scalable solution for real-world multi-modal sarcasm detection.

Abstract

Sarcasm in social media, frequently conveyed through the interplay of text and images, presents significant challenges for sentiment analysis and intention mining. Existing multi-modal sarcasm detection approaches have been shown to excessively depend on superficial cues within the textual modality, exhibiting limited capability to accurately discern sarcasm through subtle text-image interactions. To address this limitation, a novel framework, InterCLIP-MEP, is proposed. This framework integrates Interactive CLIP (InterCLIP), which employs an efficient training strategy to derive enriched cross-modal representations by embedding inter-modal information directly into each encoder, while using approximately 20.6 fewer trainable parameters compared with existing state-of-the-art (SOTA) methods. Furthermore, a Memory-Enhanced Predictor (MEP) is introduced, featuring a dynamic dual-channel memory mechanism that captures and retains valuable knowledge from test samples during inference, serving as a non-parametric classifier to enhance sarcasm detection robustness. Extensive experiments on MMSD, MMSD2.0, and DocMSU show that InterCLIP-MEP achieves SOTA performance, specifically improving accuracy by 1.08% and F1 score by 1.51% on MMSD2.0. Under distributional shift evaluation, it attains 73.96% accuracy, exceeding its memory-free variant by nearly 10% and the previous SOTA by over 15%, demonstrating superior stability and adaptability. The implementation of InterCLIP-MEP is publicly available at https://github.com/CoderChen01/InterCLIP-MEP.

Paper Structure

This paper contains 37 sections, 11 equations, 10 figures, 13 tables, 1 algorithm.

Figures (10)

  • Figure 1: An overview of the limitations inherent in existing multi-modal sarcasm detection frameworks. Panels (a) and (b) illustrate two predominant multi-modal sarcasm detection pipelines, with their respective shortcomings highlighted by red question marks. Panel (c) provides a visual representation of multi-modal sarcasm cues, demonstrating instances where such cues are either accurately identified or misinterpreted in a multi-modal sarcasm sample.
  • Figure 2: Overview of our framework. (I) Training Interactive CLIP (InterCLIP): Vision and text representations are extracted using separate encoders and embedded into the top-$n$ layers of the opposite modality's encoder for interaction. The top-$n$ layers are fine-tuned with LoRA, while the rest of the encoder remains frozen. Final vision and text representations are concatenated and used to train a classification module for identifying multi-modal sarcasm. A projection module is also trained to project representations into a latent space. (II) Memory-Enhanced Predictor (MEP): During inference, InterCLIP generates interactive representations. The classification module assigns pseudo-labels, and the projection module provides projection features. MEP updates dynamic dual-channel memory with these features and pseudo-labels. The final prediction of the current sample is made by comparing its projected feature with those in memory.
  • Figure 3: Structure of the conditional self-attention.
  • Figure 4: Hyperparameter study curves for w/ T2V. Panel (d) compares results with those from using only the classification module $\mathcal{F}_{c}$ for prediction.
  • Figure 5: Case study of InterCLIP-MEP. In the figure, the emojis in the Sample column denote the ground-truth labels from the dataset. MEP represents the labels predicted by the memory-enhanced predictor, and $\mathcal{F}_{c}$ represents the labels predicted by the classification module.
  • ...and 5 more figures