Table of Contents
Fetching ...

MuSaRoNews: A Multidomain, Multimodal Satire Dataset from Romanian News Articles

Răzvan-Alexandru Smădu, Andreea Iuga, Dumitru-Clementin Cercel

TL;DR

This work addresses the challenge of detecting satire in Romanian news by proposing MuSaRoNews, a large multimodal dataset that pairs text (headlines and articles) with images across seven domains. The authors establish baseline and domain-adaptive multimodal models using a BDANN-like architecture with Romanian BERT for text and VGG-19 for images, demonstrating that combining modalities yields improvements over single modalities. They also explore unsupervised domain adaptation at the topic level, showing potential generalization benefits, and perform modality ablation to quantify each signal's contribution. The dataset, its ethical considerations, and the experimental findings collectively advance satire detection research in low-resource languages and facilitate future work on multimodal misinformation cues.

Abstract

Satire and fake news can both contribute to the spread of false information, even though both have different purposes (one if for amusement, the other is to misinform). However, it is not enough to rely purely on text to detect the incongruity between the surface meaning and the actual meaning of the news articles, and, often, other sources of information (e.g., visual) provide an important clue for satire detection. This work introduces a multimodal corpus for satire detection in Romanian news articles named MuSaRoNews. Specifically, we gathered 117,834 public news articles from real and satirical news sources, composing the first multimodal corpus for satire detection in the Romanian language. We conducted experiments and showed that the use of both modalities improves performance.

MuSaRoNews: A Multidomain, Multimodal Satire Dataset from Romanian News Articles

TL;DR

This work addresses the challenge of detecting satire in Romanian news by proposing MuSaRoNews, a large multimodal dataset that pairs text (headlines and articles) with images across seven domains. The authors establish baseline and domain-adaptive multimodal models using a BDANN-like architecture with Romanian BERT for text and VGG-19 for images, demonstrating that combining modalities yields improvements over single modalities. They also explore unsupervised domain adaptation at the topic level, showing potential generalization benefits, and perform modality ablation to quantify each signal's contribution. The dataset, its ethical considerations, and the experimental findings collectively advance satire detection research in low-resource languages and facilitate future work on multimodal misinformation cues.

Abstract

Satire and fake news can both contribute to the spread of false information, even though both have different purposes (one if for amusement, the other is to misinform). However, it is not enough to rely purely on text to detect the incongruity between the surface meaning and the actual meaning of the news articles, and, often, other sources of information (e.g., visual) provide an important clue for satire detection. This work introduces a multimodal corpus for satire detection in Romanian news articles named MuSaRoNews. Specifically, we gathered 117,834 public news articles from real and satirical news sources, composing the first multimodal corpus for satire detection in the Romanian language. We conducted experiments and showed that the use of both modalities improves performance.

Paper Structure

This paper contains 20 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The multi-modal model architecture.
  • Figure 2: t-SNE representation of the training set on the articles' content.
  • Figure 4: Regular news topic distribution.
  • Figure 6: Tokens distribution for mainstream news article text.
  • Figure 8: Token distribution for satirical news article texts.