MuSaRoNews: A Multidomain, Multimodal Satire Dataset from Romanian News Articles
Răzvan-Alexandru Smădu, Andreea Iuga, Dumitru-Clementin Cercel
TL;DR
This work addresses the challenge of detecting satire in Romanian news by proposing MuSaRoNews, a large multimodal dataset that pairs text (headlines and articles) with images across seven domains. The authors establish baseline and domain-adaptive multimodal models using a BDANN-like architecture with Romanian BERT for text and VGG-19 for images, demonstrating that combining modalities yields improvements over single modalities. They also explore unsupervised domain adaptation at the topic level, showing potential generalization benefits, and perform modality ablation to quantify each signal's contribution. The dataset, its ethical considerations, and the experimental findings collectively advance satire detection research in low-resource languages and facilitate future work on multimodal misinformation cues.
Abstract
Satire and fake news can both contribute to the spread of false information, even though both have different purposes (one if for amusement, the other is to misinform). However, it is not enough to rely purely on text to detect the incongruity between the surface meaning and the actual meaning of the news articles, and, often, other sources of information (e.g., visual) provide an important clue for satire detection. This work introduces a multimodal corpus for satire detection in Romanian news articles named MuSaRoNews. Specifically, we gathered 117,834 public news articles from real and satirical news sources, composing the first multimodal corpus for satire detection in the Romanian language. We conducted experiments and showed that the use of both modalities improves performance.
