InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection
Junjie Chen, Hang Yu, Subin Huang, Sanmin Liu, Linfeng Zhang
TL;DR
InterCLIP-MEP tackles multi-modal sarcasm detection by combining Interactive CLIP (InterCLIP), which injects cross-modal information into text and vision encoders, with a Memory-Enhanced Predictor (MEP) that dynamically preserves low-entropy test exemplars for non-parametric inference. The approach employs an efficient training strategy using LoRA to fine-tune only top layers, reducing trainable parameters by approximately $20.6\times$ while maintaining or exceeding SOTA performance across MMSD, MMSD2.0, and DocMSU. Empirical results show improvements of around $1.08\%$ in accuracy and $1.51\%$ in F1 on MMSD2.0, and strong robustness under distributional shift (up to $73.96\%$ accuracy) thanks to memory-based inference. The work also demonstrates efficiency benefits, case studies, and ablations that validate the importance of cross-modal interaction and memory in handling subtle sarcasm cues, offering a practical, scalable solution for real-world multi-modal sarcasm detection.
Abstract
Sarcasm in social media, frequently conveyed through the interplay of text and images, presents significant challenges for sentiment analysis and intention mining. Existing multi-modal sarcasm detection approaches have been shown to excessively depend on superficial cues within the textual modality, exhibiting limited capability to accurately discern sarcasm through subtle text-image interactions. To address this limitation, a novel framework, InterCLIP-MEP, is proposed. This framework integrates Interactive CLIP (InterCLIP), which employs an efficient training strategy to derive enriched cross-modal representations by embedding inter-modal information directly into each encoder, while using approximately 20.6$\times$ fewer trainable parameters compared with existing state-of-the-art (SOTA) methods. Furthermore, a Memory-Enhanced Predictor (MEP) is introduced, featuring a dynamic dual-channel memory mechanism that captures and retains valuable knowledge from test samples during inference, serving as a non-parametric classifier to enhance sarcasm detection robustness. Extensive experiments on MMSD, MMSD2.0, and DocMSU show that InterCLIP-MEP achieves SOTA performance, specifically improving accuracy by 1.08% and F1 score by 1.51% on MMSD2.0. Under distributional shift evaluation, it attains 73.96% accuracy, exceeding its memory-free variant by nearly 10% and the previous SOTA by over 15%, demonstrating superior stability and adaptability. The implementation of InterCLIP-MEP is publicly available at https://github.com/CoderChen01/InterCLIP-MEP.
