Table of Contents
Fetching ...

AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation

Xiyuan Gao, Shubhi Bansal, Kushaan Gowda, Zhu Li, Shekhar Nayak, Nagendra Kumar, Matt Coler

TL;DR

This work tackles multimodal sarcasm detection under data scarcity by introducing AMuSeD, a framework that combines text-audio bimodal data augmentation with self-attention-based feature fusion. It employs back translation for text augmentation and a fine-tuned FastSpeech 2–based TTS, alongside VGGish audio features, to synthesize aligned multimodal data and improve detection. The study demonstrates that self-attention fused representations can achieve a strong F1 of 81.0% on MUStARD, outperforming several baselines and even some three-modality models, while also highlighting the importance of augmentation volume and audio quality. The findings suggest practical paths for scalable, multimodal sarcasm detection and point to future work in enhancing prosody, incorporating video, and expanding cultural and linguistic coverage.

Abstract

Detecting sarcasm effectively requires a nuanced understanding of context, including vocal tones and facial expressions. The progression towards multimodal computational methods in sarcasm detection, however, faces challenges due to the scarcity of data. To address this, we present AMuSeD (Attentive deep neural network for MUltimodal Sarcasm dEtection incorporating bi-modal Data augmentation). This approach utilizes the Multimodal Sarcasm Detection Dataset (MUStARD) and introduces a two-phase bimodal data augmentation strategy. The first phase involves generating varied text samples through Back Translation from several secondary languages. The second phase involves the refinement of a FastSpeech 2-based speech synthesis system, tailored specifically for sarcasm to retain sarcastic intonations. Alongside a cloud-based Text-to-Speech (TTS) service, this Fine-tuned FastSpeech 2 system produces corresponding audio for the text augmentations. We also investigate various attention mechanisms for effectively merging text and audio data, finding self-attention to be the most efficient for bimodal integration. Our experiments reveal that this combined augmentation and attention approach achieves a significant F1-score of 81.0% in text-audio modalities, surpassing even models that use three modalities from the MUStARD dataset.

AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation

TL;DR

This work tackles multimodal sarcasm detection under data scarcity by introducing AMuSeD, a framework that combines text-audio bimodal data augmentation with self-attention-based feature fusion. It employs back translation for text augmentation and a fine-tuned FastSpeech 2–based TTS, alongside VGGish audio features, to synthesize aligned multimodal data and improve detection. The study demonstrates that self-attention fused representations can achieve a strong F1 of 81.0% on MUStARD, outperforming several baselines and even some three-modality models, while also highlighting the importance of augmentation volume and audio quality. The findings suggest practical paths for scalable, multimodal sarcasm detection and point to future work in enhancing prosody, incorporating video, and expanding cultural and linguistic coverage.

Abstract

Detecting sarcasm effectively requires a nuanced understanding of context, including vocal tones and facial expressions. The progression towards multimodal computational methods in sarcasm detection, however, faces challenges due to the scarcity of data. To address this, we present AMuSeD (Attentive deep neural network for MUltimodal Sarcasm dEtection incorporating bi-modal Data augmentation). This approach utilizes the Multimodal Sarcasm Detection Dataset (MUStARD) and introduces a two-phase bimodal data augmentation strategy. The first phase involves generating varied text samples through Back Translation from several secondary languages. The second phase involves the refinement of a FastSpeech 2-based speech synthesis system, tailored specifically for sarcasm to retain sarcastic intonations. Alongside a cloud-based Text-to-Speech (TTS) service, this Fine-tuned FastSpeech 2 system produces corresponding audio for the text augmentations. We also investigate various attention mechanisms for effectively merging text and audio data, finding self-attention to be the most efficient for bimodal integration. Our experiments reveal that this combined augmentation and attention approach achieves a significant F1-score of 81.0% in text-audio modalities, surpassing even models that use three modalities from the MUStARD dataset.

Paper Structure

This paper contains 37 sections, 14 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Sample sarcastic utterance from MUStARD dataset.
  • Figure 2: System architecture of the proposed AMuSeD.
  • Figure 3: Schematic overview of data augmentation.
  • Figure 4: The effect of augmented data size.
  • Figure 5: The effect of synthesizers.
  • ...and 1 more figures