Teaching Sarcasm: Few-Shot Multimodal Sarcasm Detection via Distillation to a Parameter-Efficient Student
Soumyadeep Jana, Sanasam Ranbir Singh
TL;DR
The paper tackles few‑shot multimodal sarcasm detection by distilling knowledge from a large, sarcasm‑trained teacher into parameter‑efficient CLIP‑based students through knowledge distillation. It introduces an entropy‑aware gating mechanism that scales the distillation signal by the teacher’s confidence, mitigating unreliable guidance. Empirical results show that PEKD variants, especially LoRA‑CLIP with KD, achieve state‑of‑the‑art performance in 1% data regimes and often outperform large LVLMs with orders of magnitude fewer trainable parameters, with strong cross‑dataset generalization. The approach is modular, scalable to other multimodal models, and demonstrates improved representation alignment and prediction confidence under data scarcity.
Abstract
Multimodal sarcasm detection is challenging, especially in low-resource settings where subtle image-text contradictions are hard to learn due to scarce annotated data, which hinders the model's performance. Parameter-efficient fine-tuning (PEFT) methods like adapters, LoRA, and prompt tuning reduce overfitting but struggle to reach optimal performance due to limited supervision from few-shot data. We propose PEKD, a unified framework that enhances PEFT methods via distillation from an expert model trained on large-scale sarcasm data, which acts as the teacher. To mitigate unreliable signals from the teacher, we introduce an entropy-aware gating mechanism that dynamically adjusts the distillation strength based on teacher confidence. Experiments on two public datasets demonstrate that our PEKD framework enables PEFT methods to outperform both prior parameter-efficient approaches and large multimodal models, achieving strong results in the few-shot scenario. The framework is modular and adaptable to a wide range of multimodal models and tasks.
