Exploring Diachronic and Diatopic Changes in Dialect Continua: Tasks, Datasets and Challenges
Melis Çelikkol, Lydia Körber, Wei Zhao
TL;DR
This paper addresses the gap at the intersection of diachronic and diatopic variation in dialect NLP by surveying nine tasks and datasets across five dialects from Slavic, Romance, and Germanic families, in both spoken and written modalities. It aggregates findings from corpus construction to geolocation, distance estimation, and variant transition modeling, highlighting data characteristics, methodological approaches, and open challenges. The authors identify significant gaps in state-of-the-art methods for dialect continua, emphasize data reliability and ethical considerations, and call for inclusive resources and standard benchmarks. Overall, the work provides a structured roadmap to advance diachronic-diatopic dialect NLP and encourage broader representation of non-standard language varieties in computational linguistics.
Abstract
Everlasting contact between language communities leads to constant changes in languages over time, and gives rise to language varieties and dialects. However, the communities speaking non-standard language are often overlooked by non-inclusive NLP technologies. Recently, there has been a surge of interest in studying diatopic and diachronic changes in dialect NLP, but there is currently no research exploring the intersection of both. Our work aims to fill this gap by systematically reviewing diachronic and diatopic papers from a unified perspective. In this work, we critically assess nine tasks and datasets across five dialects from three language families (Slavic, Romance, and Germanic) in both spoken and written modalities. The tasks covered are diverse, including corpus construction, dialect distance estimation, and dialect geolocation prediction, among others. Moreover, we outline five open challenges regarding changes in dialect use over time, the reliability of dialect datasets, the importance of speaker characteristics, limited coverage of dialects, and ethical considerations in data collection. We hope that our work sheds light on future research towards inclusive computational methods and datasets for language varieties and dialects.
