Table of Contents
Fetching ...

MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations

Aaron Scott, Maike Züfle, Jan Niehues

TL;DR

MuSaG introduces the first German multimodal sarcasm dataset with text, audio, and video annotations, drawn from four TV shows and totaling 214 statements (33 minutes). It provides both full multimodal labels and modality-specific annotations, enabling detailed unimodal and multimodal analyses. Benchmarking nine models shows humans rely primarily on audio cues for sarcasm, while current models rely mainly on text, highlighting a gap in true multimodal understanding. The dataset, its MuSaG-FullAgree subset, and comprehensive evaluation benchmarks offer a valuable resource for developing and evaluating multimodal sarcasm detection and human–model alignment in German.

Abstract

Sarcasm is a complex form of figurative language in which the intended meaning contradicts the literal one. Its prevalence in social media and popular culture poses persistent challenges for natural language understanding, sentiment analysis, and content moderation. With the emergence of multimodal large language models, sarcasm detection extends beyond text and requires integrating cues from audio and vision. We present MuSaG, the first German multimodal sarcasm detection dataset, consisting of 33 minutes of manually selected and human-annotated statements from German television shows. Each instance provides aligned text, audio, and video modalities, annotated separately by humans, enabling evaluation in unimodal and multimodal settings. We benchmark nine open-source and commercial models, spanning text, audio, vision, and multimodal architectures, and compare their performance to human annotations. Our results show that while humans rely heavily on audio in conversational settings, models perform best on text. This highlights a gap in current multimodal models and motivates the use of MuSaG for developing models better suited to realistic scenarios. We release MuSaG publicly to support future research on multimodal sarcasm detection and human-model alignment.

MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations

TL;DR

MuSaG introduces the first German multimodal sarcasm dataset with text, audio, and video annotations, drawn from four TV shows and totaling 214 statements (33 minutes). It provides both full multimodal labels and modality-specific annotations, enabling detailed unimodal and multimodal analyses. Benchmarking nine models shows humans rely primarily on audio cues for sarcasm, while current models rely mainly on text, highlighting a gap in true multimodal understanding. The dataset, its MuSaG-FullAgree subset, and comprehensive evaluation benchmarks offer a valuable resource for developing and evaluating multimodal sarcasm detection and human–model alignment in German.

Abstract

Sarcasm is a complex form of figurative language in which the intended meaning contradicts the literal one. Its prevalence in social media and popular culture poses persistent challenges for natural language understanding, sentiment analysis, and content moderation. With the emergence of multimodal large language models, sarcasm detection extends beyond text and requires integrating cues from audio and vision. We present MuSaG, the first German multimodal sarcasm detection dataset, consisting of 33 minutes of manually selected and human-annotated statements from German television shows. Each instance provides aligned text, audio, and video modalities, annotated separately by humans, enabling evaluation in unimodal and multimodal settings. We benchmark nine open-source and commercial models, spanning text, audio, vision, and multimodal architectures, and compare their performance to human annotations. Our results show that while humans rely heavily on audio in conversational settings, models perform best on text. This highlights a gap in current multimodal models and motivates the use of MuSaG for developing models better suited to realistic scenarios. We release MuSaG publicly to support future research on multimodal sarcasm detection and human-model alignment.

Paper Structure

This paper contains 39 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: MuSaG, our human annotated German multimodal sarcasm detection dataset.
  • Figure 2: Instructions for human annotators in German.
  • Figure 3: Instructions for human annotators in English, for reference. During the annotation process, the German instructions were used.
  • Figure 4: Prompts for multimodal sarcasm detection, single-modality.
  • Figure 5: Prompts for multimodal sarcasm detection, different modality combinations.
  • ...and 2 more figures