GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation
Stergios Chatzikyriakidis, Dimitris Papadakis, Sevasti-Ioanna Papaioannou, Erofili Psaltaki
TL;DR
GRDD+ delivers a significantly expanded Greek dialectal dataset spanning 10 varieties and 6.37 million words, addressing resource gaps for Greek dialect NLP. The authors perform cross-architecture fine-tuning using Llama-3-8B, Llama-3.1-8B, and Krikri-8B with LoRA on 26,118 dialect-rich training examples, benchmarking against frontier models. Results show consistent gains from dialect-specific fine-tuning (about 1.5–2 points on a 5-point naturalness scale), with frontier models like Claude-3.7-Sonnet excelling on some dialects and specialized models excelling on others. The work highlights data-size effects, dialectal distance from Standard Modern Greek, and the potential for smaller, dialect-focused LLMs to match or surpass larger, generic models in targeted dialect tasks. These findings establish GRDD+ as a valuable resource for Greek dialect NLP and related sociolinguistic research, while guiding future expansion to more varieties and evaluation metrics.
Abstract
We present an extended Greek Dialectal Dataset (GRDD+) 1that complements the existing GRDD dataset with more data from Cretan, Cypriot, Pontic and Northern Greek, while we add six new varieties: Greco-Corsican, Griko (Southern Italian Greek), Maniot, Heptanesian, Tsakonian, and Katharevusa Greek. The result is a dataset with total size 6,374,939 words and 10 varieties. This is the first dataset with such variation and size to date. We conduct a number of fine-tuning experiments to see the effect of good quality dialectal data on a number of LLMs. We fine-tune three model architectures (Llama-3-8B, Llama-3.1-8B, Krikri-8B) and compare the results to frontier models (Claude-3.7-Sonnet, Gemini-2.5, ChatGPT-5).
