GRDD: A Dataset for Greek Dialectal NLP
Stergios Chatzikyriakidis, Chatrine Qwaider, Ilias Kolokousis, Christina Koula, Dimitris Papadakis, Efthymia Sakellariou
TL;DR
The paper addresses the lack of large-scale dialect resources for Modern Greek by building a web-derived, imbalanced but substantial corpus across Cypriot, Pontic, Cretan, Kozani/Grevena (Northern) dialects and Standard Modern Greek, and evaluating dialect identification with both classical ML and a BiLSTM. It finds high classification accuracy, with simple models achieving around 0.92–0.94 and the BiLSTM reaching up to 0.97 on the full dataset, demonstrating that the dialects carry distinct signals. The authors acknowledge limitations from data imbalance and coarse-grained dialect labels, and they emphasize data cleaning and finer-grained annotation as directions for future work. Overall, the dataset provides a valuable baseline resource for Greek dialectal NLP and a platform for further methodological refinements and cross-dialect analyses.
Abstract
In this paper, we present a dataset for the computational study of a number of Modern Greek dialects. It consists of raw text data from four dialects of Modern Greek, Cretan, Pontic, Northern Greek and Cypriot Greek. The dataset is of considerable size, albeit imbalanced, and presents the first attempt to create large scale dialectal resources of this type for Modern Greek dialects. We then use the dataset to perform dialect idefntification. We experiment with traditional ML algorithms, as well as simple DL architectures. The results show very good performance on the task, potentially revealing that the dialects in question have distinct enough characteristics allowing even simple ML models to perform well on the task. Error analysis is performed for the top performing algorithms showing that in a number of cases the errors are due to insufficient dataset cleaning.
