Named Entity Recognition for the Kurdish Sorani Language: Dataset Creation and Comparative Analysis
Bakhtawar Abdalla, Rebwar Mala Nabi, Hassan Eshkiki, Fabio Caraffini
TL;DR
This study tackles the scarcity of annotated NER resources for Kurdish Sorani by presenting AgaCKNER, the first NER dataset for the language with 64,563 tokens, along with a supporting annotation tool. It conducts a comprehensive comparison across traditional (CRF, SVM) and neural (BiLSTM, BiLSTM-CRF) models, revealing that traditional methods often outperform neural approaches in this low-resource setting. Key findings show CRF achieving the highest F1 scores and greater stability, highlighting the continued relevance of feature-engineered, structured prediction in morphologically rich, data-scarce languages. The work provides a replicable dataset, tooling, and methodological guidance that can advance Kurdish NLP and inform low-resource NER research more broadly.
Abstract
This work contributes towards balancing the inclusivity and global applicability of natural language processing techniques by proposing the first 'name entity recognition' dataset for Kurdish Sorani, a low-resource and under-represented language, that consists of 64,563 annotated tokens. It also provides a tool for facilitating this task in this and many other languages and performs a thorough comparative analysis, including classic machine learning models and neural systems. The results obtained challenge established assumptions about the advantage of neural approaches within the context of NLP. Conventional methods, in particular CRF, obtain F1-scores of 0.825, outperforming the results of BiLSTM-based models (0.706) significantly. These findings indicate that simpler and more computationally efficient classical frameworks can outperform neural architectures in low-resource settings.
