Table of Contents
Fetching ...

A Catalog of Basque Dialectal Resources: Online Collections and Standard-to-Dialectal Adaptations

Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri

Abstract

Recent research on dialectal NLP has identified data scarcity as a primary limitation. To address this limitation, this paper presents a catalog of contemporary Basque dialectal data and resources, offering a systematic and comprehensive compilation of the dialectal data currently available in Basque. Two types of data sources have been distinguished: online data originally written in some dialect, and standard-to-dialect adapted data. The former includes all dialectal data that can be found online, such as news and radio sites, informal tweets, as well as online resources such as dictionaries, atlases, grammar rules, or videos. The latter consists of data that has been adapted from the standard variety to dialectal varieties, either manually or automatically. Regarding the manual adaptation, the test split of the XNLI Natural Language Inference dataset was manually adapted into three Basque dialects: Western, Central, and Navarrese-Lapurdian, yielding a high-quality parallel gold standard evaluation dataset. With respect to the automatic dialectal adaptation, the automatically adapted physical commonsense dataset (BasPhyCowest) underwent additional manual evaluation by native speakers to assess its quality and determine whether it could serve as a viable substitute for full manual adaptation (i.e., silver data creation).

A Catalog of Basque Dialectal Resources: Online Collections and Standard-to-Dialectal Adaptations

Abstract

Recent research on dialectal NLP has identified data scarcity as a primary limitation. To address this limitation, this paper presents a catalog of contemporary Basque dialectal data and resources, offering a systematic and comprehensive compilation of the dialectal data currently available in Basque. Two types of data sources have been distinguished: online data originally written in some dialect, and standard-to-dialect adapted data. The former includes all dialectal data that can be found online, such as news and radio sites, informal tweets, as well as online resources such as dictionaries, atlases, grammar rules, or videos. The latter consists of data that has been adapted from the standard variety to dialectal varieties, either manually or automatically. Regarding the manual adaptation, the test split of the XNLI Natural Language Inference dataset was manually adapted into three Basque dialects: Western, Central, and Navarrese-Lapurdian, yielding a high-quality parallel gold standard evaluation dataset. With respect to the automatic dialectal adaptation, the automatically adapted physical commonsense dataset (BasPhyCowest) underwent additional manual evaluation by native speakers to assess its quality and determine whether it could serve as a viable substitute for full manual adaptation (i.e., silver data creation).

Paper Structure

This paper contains 29 sections, 2 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: The Basque dialectal catalog consists of two different sources: online dialectal data and standard-to-dialect adapted data, either manually or automatically.
  • Figure 2: Example of a linguistic atlas.
  • Figure 3: Illustration of XNLIeu dialectal adaptation. XNLIvar was a compilation of different instances in three different dialects. Parallel XNLIvar provides the same instances in three Basque dialects, offering completely parallel data.
  • Figure 4: Levenshtein distance distribution for the three Parallel XNLIvar datasets.
  • Figure 5: Manual evaluation results. From left to right, results for all sentence pairs (all-pairs), as well as results for different sentence pair combinations (standard-west, standard-orthographic, and west-orthographic).
  • ...and 3 more figures