Table of Contents
Fetching ...

Shifting from endangerment to rebirth in the Artificial Intelligence Age: An Ensemble Machine Learning Approach for Hawrami Text Classification

Aram Khaksar, Hossein Hassani

TL;DR

This work tackles the endangered Hawrami dialect by assembling a labeled Hawrami text dataset and evaluating four machine learning classifiers for text classification. It presents a full NLP pipeline (data collection from ANF and Kurdipedia, native-speaker labeling, rigorous preprocessing, stratified splits with balancing, and TF-IDF feature extraction) and tests KNN, Linear SVM, LR, and DT, with Linear SVM achieving up to 96% accuracy in certain scenarios. The study demonstrates that balancing can improve macro-level performance, provides interpretability via LIME, and offers a publicly usable Hawrami dataset to accelerate language documentation in low-resource settings. Overall, the paper shows a practical workflow for leveraging ensemble ML in endangered-language NLP to bolster digital presence and linguistic research.

Abstract

Hawrami, a dialect of Kurdish, is classified as an endangered language as it suffers from the scarcity of data and the gradual loss of its speakers. Natural Language Processing projects can be used to partially compensate for data availability for endangered languages/dialects through a variety of approaches, such as machine translation, language model building, and corpora development. Similarly, NLP projects such as text classification are in language documentation. Several text classification studies have been conducted for Kurdish, but they were mainly dedicated to two particular dialects: Sorani (Central Kurdish) and Kurmanji (Northern Kurdish). In this paper, we introduce various text classification models using a dataset of 6,854 articles in Hawrami labeled into 15 categories by two native speakers. We use K-nearest Neighbor (KNN), Linear Support Vector Machine (Linear SVM), Logistic Regression (LR), and Decision Tree (DT) to evaluate how well those methods perform the classification task. The results indicate that the Linear SVM achieves a 96% of accuracy and outperforms the other approaches.

Shifting from endangerment to rebirth in the Artificial Intelligence Age: An Ensemble Machine Learning Approach for Hawrami Text Classification

TL;DR

This work tackles the endangered Hawrami dialect by assembling a labeled Hawrami text dataset and evaluating four machine learning classifiers for text classification. It presents a full NLP pipeline (data collection from ANF and Kurdipedia, native-speaker labeling, rigorous preprocessing, stratified splits with balancing, and TF-IDF feature extraction) and tests KNN, Linear SVM, LR, and DT, with Linear SVM achieving up to 96% accuracy in certain scenarios. The study demonstrates that balancing can improve macro-level performance, provides interpretability via LIME, and offers a publicly usable Hawrami dataset to accelerate language documentation in low-resource settings. Overall, the paper shows a practical workflow for leveraging ensemble ML in endangered-language NLP to bolster digital presence and linguistic research.

Abstract

Hawrami, a dialect of Kurdish, is classified as an endangered language as it suffers from the scarcity of data and the gradual loss of its speakers. Natural Language Processing projects can be used to partially compensate for data availability for endangered languages/dialects through a variety of approaches, such as machine translation, language model building, and corpora development. Similarly, NLP projects such as text classification are in language documentation. Several text classification studies have been conducted for Kurdish, but they were mainly dedicated to two particular dialects: Sorani (Central Kurdish) and Kurmanji (Northern Kurdish). In this paper, we introduce various text classification models using a dataset of 6,854 articles in Hawrami labeled into 15 categories by two native speakers. We use K-nearest Neighbor (KNN), Linear Support Vector Machine (Linear SVM), Logistic Regression (LR), and Decision Tree (DT) to evaluate how well those methods perform the classification task. The results indicate that the Linear SVM achieves a 96% of accuracy and outperforms the other approaches.
Paper Structure (23 sections, 6 equations, 14 figures, 7 tables)

This paper contains 23 sections, 6 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: On the left: the proposed unified script Kurdish by Ahmadi et al. ahmadi2019towards. On the right: A unified script for Hawrami is suggested by Marani Marani_2020hac.
  • Figure 2: The proposed pipeline for text classification.
  • Figure 3: Total data reduction for both sources.
  • Figure 4: Distribution of data collected (percentage) across categories.
  • Figure 5: The overall distribution of the number of words for all entries.
  • ...and 9 more figures