Adapter-state Sharing CLIP for Parameter-efficient Multimodal Sarcasm Detection
Soumyadeep Jana, Sahil Danayak, Sanasam Ranbir Singh
TL;DR
This paper tackles multimodal image–text sarcasm detection under resource constraints by introducing AdS-CLIP, a parameter-efficient CLIP adaptation that places adapters only in the upper transformer layers and employs an adapter-state sharing mechanism. Textual adapter states guide visual adapters to foster cross-modal learning, using a textual-state queue and attention to fuse modalities. With adapters in layers 7–12 and a small dimension, AdS-CLIP achieves state-of-the-art results on MMSD and MMSD2.0 with only 4.1M trainable parameters, outperforming stronger baselines and existing PEFT methods. Ablation and visualization analyses support the design choices, showing improved embedding separability and the critical role of cross-modal guidance. The approach offers a practical, scalable solution for efficient multimodal sarcasm detection in resource-limited settings.
Abstract
The growing prevalence of multimodal image-text sarcasm on social media poses challenges for opinion mining systems. Existing approaches rely on full fine-tuning of large models, making them unsuitable to adapt under resource-constrained settings. While recent parameter-efficient fine-tuning (PEFT) methods offer promise, their off-the-shelf use underperforms on complex tasks like sarcasm detection. We propose AdS-CLIP (Adapter-state Sharing in CLIP), a lightweight framework built on CLIP that inserts adapters only in the upper layers to preserve low-level unimodal representations in the lower layers and introduces a novel adapter-state sharing mechanism, where textual adapters guide visual ones to promote efficient cross-modal learning in the upper layers. Experiments on two public benchmarks demonstrate that AdS-CLIP not only outperforms standard PEFT methods but also existing multimodal baselines with significantly fewer trainable parameters.
