EuskañolDS: A Naturally Sourced Corpus for Basque-Spanish Code-Switching
Maite Heredia, Jeremy Barnes, Aitor Soroa
TL;DR
EuskañolDS addresses the scarcity of Basque–Spanish code-switching data by building a naturally sourced CS corpus through a semi supervised pipeline that leverages language identification to detect CS in BasqueParl, HelduGazte, and Covid-19 corpora, followed by manual validation. The dataset yields two splits, a silver set of 20,008 automatically classified instances and a gold set of 927 manually validated instances, enabling both analysis and downstream NLP tasks. The work offers a first resource for Basque–Spanish CS, presents detailed qualitative insights into CS typology and dialectal variation, and provides a public release to support linguistic and practical NLP progress. This resource is poised to facilitate token-level language identification, stance detection, and broader studies of language contact in a low-resource pair with real-world social and formal texts.
Abstract
Code-switching (CS) remains a significant challenge in Natural Language Processing (NLP), mainly due a lack of relevant data. In the context of the contact between the Basque and Spanish languages in the north of the Iberian Peninsula, CS frequently occurs in both formal and informal spontaneous interactions. However, resources to analyse this phenomenon and support the development and evaluation of models capable of understanding and generating code-switched language for this language pair are almost non-existent. We introduce a first approach to develop a naturally sourced corpus for Basque-Spanish code-switching. Our methodology consists of identifying CS texts from previously available corpora using language identification models, which are then manually validated to obtain a reliable subset of CS instances. We present the properties of our corpus and make it available under the name EuskañolDS.
