Nullpointer at ArAIEval Shared Task: Arabic Propagandist Technique Detection with Token-to-Word Mapping in Sequence Tagging
Abrar Abir, Kemal Oflazer
TL;DR
The paper tackles Arabic propaganda technique detection in multi-genre text by fine-tuning AraBERT v2 with a neural network classifier for sequence tagging. It systematically compares token-level and word-level tagging strategies, finding that word-level prediction using the first token of each word, combined with genre encoding (tweet vs news), yields the strongest performance. A robust preprocessing pipeline addresses Unicode, misaligned spans, and user mentions to ensure clean annotations. The final model, trained on the merged training and development data, achieves state-of-the-art-like performance (up to 26.68 on the leaderboard) and demonstrates the value of token-to-word mapping and contextual genre information for Arabic propaganda detection.
Abstract
This paper investigates the optimization of propaganda technique detection in Arabic text, including tweets \& news paragraphs, from ArAIEval shared task 1. Our approach involves fine-tuning the AraBERT v2 model with a neural network classifier for sequence tagging. Experimental results show relying on the first token of the word for technique prediction produces the best performance. In addition, incorporating genre information as a feature further enhances the model's performance. Our system achieved a score of 25.41, placing us 4$^{th}$ on the leaderboard. Subsequent post-submission improvements further raised our score to 26.68.
