Target-Augmented Shared Fusion-based Multimodal Sarcasm Explanation Generation
Palaash Goel, Dushyant Singh Chauhan, Md Shad Akhtar
TL;DR
Target-Augmented Shared Fusion-based Multimodal Sarcasm Explanation (TURBO) addresses the challenge of explaining sarcasm in multimodal posts by incorporating the explicit target of sarcasm and external knowledge through a knowledge- enriched graph. It combines a novel shared fusion mechanism with a Graph Convolutional Network over a ConceptNet-informed knowledge graph, and uses a BART-based generator to produce explanations. On the MORE+ dataset, TURBO outperforms prior state-of-the-art methods and remains competitive with large multimodal language models while using far fewer parameters (approximately 234 million vs billions). Human evaluation corroborates the quality of TURBO's explanations, though the study also discusses limitations and ethical considerations in generating explanations of sarcasm. Overall, TURBO demonstrates that targeted, knowledge-aware fusion can enhance multimodal sarcasm explanation with improved interpretability and efficiency ($+$ strong qualitative results).
Abstract
Sarcasm is a linguistic phenomenon that intends to ridicule a target (e.g., entity, event, or person) in an inherent way. Multimodal Sarcasm Explanation (MuSE) aims at revealing the intended irony in a sarcastic post using a natural language explanation. Though important, existing systems overlooked the significance of the target of sarcasm in generating explanations. In this paper, we propose a Target-aUgmented shaRed fusion-Based sarcasm explanatiOn model, aka. TURBO. We design a novel shared-fusion mechanism to leverage the inter-modality relationships between an image and its caption. TURBO assumes the target of the sarcasm and guides the multimodal shared fusion mechanism in learning intricacies of the intended irony for explanations. We evaluate our proposed TURBO model on the MORE+ dataset. Comparison against multiple baselines and state-of-the-art models signifies the performance improvement of TURBO by an average margin of $+3.3\%$. Moreover, we explore LLMs in zero and one-shot settings for our task and observe that LLM-generated explanation, though remarkable, often fails to capture the critical nuances of the sarcasm. Furthermore, we supplement our study with extensive human evaluation on TURBO's generated explanations and find them out to be comparatively better than other systems.
