InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection

Junjie Chen; Hang Yu; Subin Huang; Sanmin Liu; Linfeng Zhang

InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection

Junjie Chen, Hang Yu, Subin Huang, Sanmin Liu, Linfeng Zhang

TL;DR

InterCLIP-MEP tackles multi-modal sarcasm detection by combining Interactive CLIP (InterCLIP), which injects cross-modal information into text and vision encoders, with a Memory-Enhanced Predictor (MEP) that dynamically preserves low-entropy test exemplars for non-parametric inference. The approach employs an efficient training strategy using LoRA to fine-tune only top layers, reducing trainable parameters by approximately $20.6\times$ while maintaining or exceeding SOTA performance across MMSD, MMSD2.0, and DocMSU. Empirical results show improvements of around $1.08\%$ in accuracy and $1.51\%$ in F1 on MMSD2.0, and strong robustness under distributional shift (up to $73.96\%$ accuracy) thanks to memory-based inference. The work also demonstrates efficiency benefits, case studies, and ablations that validate the importance of cross-modal interaction and memory in handling subtle sarcasm cues, offering a practical, scalable solution for real-world multi-modal sarcasm detection.

Abstract

Sarcasm in social media, frequently conveyed through the interplay of text and images, presents significant challenges for sentiment analysis and intention mining. Existing multi-modal sarcasm detection approaches have been shown to excessively depend on superficial cues within the textual modality, exhibiting limited capability to accurately discern sarcasm through subtle text-image interactions. To address this limitation, a novel framework, InterCLIP-MEP, is proposed. This framework integrates Interactive CLIP (InterCLIP), which employs an efficient training strategy to derive enriched cross-modal representations by embedding inter-modal information directly into each encoder, while using approximately 20.6$\times$ fewer trainable parameters compared with existing state-of-the-art (SOTA) methods. Furthermore, a Memory-Enhanced Predictor (MEP) is introduced, featuring a dynamic dual-channel memory mechanism that captures and retains valuable knowledge from test samples during inference, serving as a non-parametric classifier to enhance sarcasm detection robustness. Extensive experiments on MMSD, MMSD2.0, and DocMSU show that InterCLIP-MEP achieves SOTA performance, specifically improving accuracy by 1.08% and F1 score by 1.51% on MMSD2.0. Under distributional shift evaluation, it attains 73.96% accuracy, exceeding its memory-free variant by nearly 10% and the previous SOTA by over 15%, demonstrating superior stability and adaptability. The implementation of InterCLIP-MEP is publicly available at https://github.com/CoderChen01/InterCLIP-MEP.

InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection

TL;DR

while maintaining or exceeding SOTA performance across MMSD, MMSD2.0, and DocMSU. Empirical results show improvements of around

in accuracy and

in F1 on MMSD2.0, and strong robustness under distributional shift (up to

accuracy) thanks to memory-based inference. The work also demonstrates efficiency benefits, case studies, and ablations that validate the importance of cross-modal interaction and memory in handling subtle sarcasm cues, offering a practical, scalable solution for real-world multi-modal sarcasm detection.

Abstract

fewer trainable parameters compared with existing state-of-the-art (SOTA) methods. Furthermore, a Memory-Enhanced Predictor (MEP) is introduced, featuring a dynamic dual-channel memory mechanism that captures and retains valuable knowledge from test samples during inference, serving as a non-parametric classifier to enhance sarcasm detection robustness. Extensive experiments on MMSD, MMSD2.0, and DocMSU show that InterCLIP-MEP achieves SOTA performance, specifically improving accuracy by 1.08% and F1 score by 1.51% on MMSD2.0. Under distributional shift evaluation, it attains 73.96% accuracy, exceeding its memory-free variant by nearly 10% and the previous SOTA by over 15%, demonstrating superior stability and adaptability. The implementation of InterCLIP-MEP is publicly available at https://github.com/CoderChen01/InterCLIP-MEP.

InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection

TL;DR

Abstract

InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)