ThatiAR: Subjectivity Detection in Arabic News Sentences

Reem Suwaileh; Maram Hasanain; Fatema Hubail; Wajdi Zaghouani; Firoj Alam

ThatiAR: Subjectivity Detection in Arabic News Sentences

Reem Suwaileh, Maram Hasanain, Fatema Hubail, Wajdi Zaghouani, Firoj Alam

TL;DR

ThatiAR presents the first large Arabic subjectivity-detection dataset for news sentences, comprising approximately $3.6K$ manually labeled items and accompanying GPT-4o explanations and instruction data. The paper analyzes annotator bias and benchmark results across PLMs and LLMs, showing that in-context learning with GPT-4o can achieve strong performance while Arabic PLMs provide robust Recall and $F_1$ in monolingual settings. It demonstrates a thorough data-collection and annotation pipeline, a rationales-enabled annotation process, and an instruction dataset to support future instruction-following models. The work offers a valuable resource and methodological foundation for Arabic NLP and media analysis, with planned public release to foster community-driven progress.

Abstract

Detecting subjectivity in news sentences is crucial for identifying media bias, enhancing credibility, and combating misinformation by flagging opinion-based content. It provides insights into public sentiment, empowers readers to make informed decisions, and encourages critical thinking. While research has developed methods and systems for this purpose, most efforts have focused on English and other high-resourced languages. In this study, we present the first large dataset for subjectivity detection in Arabic, consisting of ~3.6K manually annotated sentences, and GPT-4o based explanation. In addition, we included instructions (both in English and Arabic) to facilitate LLM based fine-tuning. We provide an in-depth analysis of the dataset, annotation process, and extensive benchmark results, including PLMs and LLMs. Our analysis of the annotation process highlights that annotators were strongly influenced by their political, cultural, and religious backgrounds, especially at the beginning of the annotation process. The experimental results suggest that LLMs with in-context learning provide better performance. We aim to release the dataset and resources for the community.

ThatiAR: Subjectivity Detection in Arabic News Sentences

TL;DR

ThatiAR presents the first large Arabic subjectivity-detection dataset for news sentences, comprising approximately

manually labeled items and accompanying GPT-4o explanations and instruction data. The paper analyzes annotator bias and benchmark results across PLMs and LLMs, showing that in-context learning with GPT-4o can achieve strong performance while Arabic PLMs provide robust Recall and

in monolingual settings. It demonstrates a thorough data-collection and annotation pipeline, a rationales-enabled annotation process, and an instruction dataset to support future instruction-following models. The work offers a valuable resource and methodological foundation for Arabic NLP and media analysis, with planned public release to foster community-driven progress.

Abstract

Paper Structure (30 sections, 2 equations, 2 figures, 12 tables)

This paper contains 30 sections, 2 equations, 2 figures, 12 tables.

Introduction
Related Work
Dataset
Data Collection
News Article Selection
Preprocessing
Sentence Selection
Data Annotation
Data Analysis
Annotation Agreement.
Deep Analysis.
Experimental Setup
Data
Models
Simple Models:
...and 15 more sections

Figures (2)

Figure 1: An example of a subjective sentence that can be misleading and cause fear.
Figure 2: The pipeline of the data collection, annotation, and instruction/explanation datasets development process.

ThatiAR: Subjectivity Detection in Arabic News Sentences

TL;DR

Abstract

ThatiAR: Subjectivity Detection in Arabic News Sentences

Authors

TL;DR

Abstract

Table of Contents

Figures (2)