Table of Contents
Fetching ...

Named Entity Recognition for the Kurdish Sorani Language: Dataset Creation and Comparative Analysis

Bakhtawar Abdalla, Rebwar Mala Nabi, Hassan Eshkiki, Fabio Caraffini

TL;DR

This study tackles the scarcity of annotated NER resources for Kurdish Sorani by presenting AgaCKNER, the first NER dataset for the language with 64,563 tokens, along with a supporting annotation tool. It conducts a comprehensive comparison across traditional (CRF, SVM) and neural (BiLSTM, BiLSTM-CRF) models, revealing that traditional methods often outperform neural approaches in this low-resource setting. Key findings show CRF achieving the highest F1 scores and greater stability, highlighting the continued relevance of feature-engineered, structured prediction in morphologically rich, data-scarce languages. The work provides a replicable dataset, tooling, and methodological guidance that can advance Kurdish NLP and inform low-resource NER research more broadly.

Abstract

This work contributes towards balancing the inclusivity and global applicability of natural language processing techniques by proposing the first 'name entity recognition' dataset for Kurdish Sorani, a low-resource and under-represented language, that consists of 64,563 annotated tokens. It also provides a tool for facilitating this task in this and many other languages and performs a thorough comparative analysis, including classic machine learning models and neural systems. The results obtained challenge established assumptions about the advantage of neural approaches within the context of NLP. Conventional methods, in particular CRF, obtain F1-scores of 0.825, outperforming the results of BiLSTM-based models (0.706) significantly. These findings indicate that simpler and more computationally efficient classical frameworks can outperform neural architectures in low-resource settings.

Named Entity Recognition for the Kurdish Sorani Language: Dataset Creation and Comparative Analysis

TL;DR

This study tackles the scarcity of annotated NER resources for Kurdish Sorani by presenting AgaCKNER, the first NER dataset for the language with 64,563 tokens, along with a supporting annotation tool. It conducts a comprehensive comparison across traditional (CRF, SVM) and neural (BiLSTM, BiLSTM-CRF) models, revealing that traditional methods often outperform neural approaches in this low-resource setting. Key findings show CRF achieving the highest F1 scores and greater stability, highlighting the continued relevance of feature-engineered, structured prediction in morphologically rich, data-scarce languages. The work provides a replicable dataset, tooling, and methodological guidance that can advance Kurdish NLP and inform low-resource NER research more broadly.

Abstract

This work contributes towards balancing the inclusivity and global applicability of natural language processing techniques by proposing the first 'name entity recognition' dataset for Kurdish Sorani, a low-resource and under-represented language, that consists of 64,563 annotated tokens. It also provides a tool for facilitating this task in this and many other languages and performs a thorough comparative analysis, including classic machine learning models and neural systems. The results obtained challenge established assumptions about the advantage of neural approaches within the context of NLP. Conventional methods, in particular CRF, obtain F1-scores of 0.825, outperforming the results of BiLSTM-based models (0.706) significantly. These findings indicate that simpler and more computationally efficient classical frameworks can outperform neural architectures in low-resource settings.

Paper Structure

This paper contains 38 sections, 5 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Constructing the dataset: key steps.
  • Figure 2: Interface of the AGA NER Annotation Tool.
  • Figure 3: Navigation buttons and drop-down menus for annotation in the AGA NER Annotation Tool web-app.
  • Figure 4: A fragment from AgaCKNER_Dataset.txt with English translations (column order adjusted per request).