Where Are You From? Let Me Guess! Subdialect Recognition of Speeches in Sorani Kurdish

Sana Isam; Hossein Hassani

Where Are You From? Let Me Guess! Subdialect Recognition of Speeches in Sorani Kurdish

Sana Isam, Hossein Hassani

TL;DR

The paper tackles subdialect classification in Sorani Kurdish by creating Sorani Nas, a field-recorded speech dataset comprising 29h16m40s from 107 speakers across six subdialects. It benchmarks three neural architectures—ANN, CNN, and RNN-LSTM—using MFCC features and explores data-balancing and segment-length strategies, reporting that RNN-LSTM achieves up to 96% accuracy and CNN up to 93% under oversampled conditions. The work demonstrates that balanced datasets and medium-length segments yield the best results, while it also documents challenges in data collection, speaker identification, and ethical considerations. As a public resource, Sorani Nas provides a foundation for future dialect research and potential expansion to additional Kurdish dialects, with implications for low-resource language processing and regional NLP applications.

Abstract

Classifying Sorani Kurdish subdialects poses a challenge due to the need for publicly available datasets or reliable resources like social media or websites for data collection. We conducted field visits to various cities and villages to address this issue, connecting with native speakers from different age groups, genders, academic backgrounds, and professions. We recorded their voices while engaging in conversations covering diverse topics such as lifestyle, background history, hobbies, interests, vacations, and life lessons. The target area of the research was the Kurdistan Region of Iraq. As a result, we accumulated 29 hours, 16 minutes, and 40 seconds of audio recordings from 107 interviews, constituting an unbalanced dataset encompassing six subdialects. Subsequently, we adapted three deep learning models: ANN, CNN, and RNN-LSTM. We explored various configurations, including different track durations, dataset splitting, and imbalanced dataset handling techniques such as oversampling and undersampling. Two hundred and twenty-five(225) experiments were conducted, and the outcomes were evaluated. The results indicated that the RNN-LSTM outperforms the other methods by achieving an accuracy of 96%. CNN achieved an accuracy of 93%, and ANN 75%. All three models demonstrated improved performance when applied to balanced datasets, primarily when we followed the oversampling approach. Future studies can explore additional future research directions to include other Kurdish dialects.

Where Are You From? Let Me Guess! Subdialect Recognition of Speeches in Sorani Kurdish

TL;DR

Abstract

Paper Structure (28 sections, 1 equation, 28 figures, 6 tables)

This paper contains 28 sections, 1 equation, 28 figures, 6 tables.

Introduction
The Kurdish Language
Related work
Traditional Approaches of Speech Recognition
Deep Learning Approaches of Speech Recognition
Method
Data Collection
Speech Data Editing and Segmentation
Data preprocessing
Feature Extraction
Approaches
Artificial Neural Network
Convolutional Neural Networks
Recurrent Neural Networks-Long Short-Term Memory
Experiments, Results, and Discussion
...and 13 more sections

Figures (28)

Figure 1: Geographical distribution of Sorani dialect and its subdialects, adopted from ?).
Figure 2: Feature extraction workflow
Figure 3: ANN architecture for dialect classification with the input acoustic features of speech signal and subdialects as targets with ReLU activation function in hidden layers
Figure 4: The adapted CNN
Figure 5: The type of GPU provided by Google Collaboratory.
...and 23 more figures

Where Are You From? Let Me Guess! Subdialect Recognition of Speeches in Sorani Kurdish

TL;DR

Abstract

Where Are You From? Let Me Guess! Subdialect Recognition of Speeches in Sorani Kurdish

Authors

TL;DR

Abstract

Table of Contents

Figures (28)