SeLeRoSa: Sentence-Level Romanian Satire Detection Dataset
Răzvan-Alexandru Smădu, Andreea Iuga, Dumitru-Clementin Cercel, Florin Pop
TL;DR
SeLeRoSa addresses the lack of Romanian sentence-level satire datasets by compiling 13,873 labeled sentences from satirical and factual news. The authors evaluate a broad spectrum of baselines, from fine-tuned Romanian pre-trained LMs to open-source LLMs and closed-source GPT models, in zero-shot and fine-tuned settings. They provide two data variants (anonymized and anonymized+pre-processed) and report moderate inter-annotator agreement, highlighting the task's inherent challenges. Fine-tuned models, especially RoGemma 2 9B, achieve the best performance (around 76.9% accuracy and 80.7% F1), while zero-shot performance remains modest though reasoning-based models show promise. The dataset and code are released to stimulate further research in low-resource Romanian NLP for satire detection.
Abstract
Satire, irony, and sarcasm are techniques typically used to express humor and critique, rather than deceive; however, they can occasionally be mistaken for factual reporting, akin to fake news. These techniques can be applied at a more granular level, allowing satirical information to be incorporated into news articles. In this paper, we introduce the first sentence-level dataset for Romanian satire detection for news articles, called SeLeRoSa. The dataset comprises 13,873 manually annotated sentences spanning various domains, including social issues, IT, science, and movies. With the rise and recent progress of large language models (LLMs) in the natural language processing literature, LLMs have demonstrated enhanced capabilities to tackle various tasks in zero-shot settings. We evaluate multiple baseline models based on LLMs in both zero-shot and fine-tuning settings, as well as baseline transformer-based models. Our findings reveal the current limitations of these models in the sentence-level satire detection task, paving the way for new research directions.
