Table of Contents
Fetching ...

BERnaT: Basque Encoders for Representing Natural Textual Diversity

Ekhi Azurmendi, Joseba Fernandez de Landa, Jaione Bengoetxea, Maite Heredia, Julen Etxaniz, Mikel Zubillaga, Ander Soraluze, Aitor Soroa

TL;DR

This work confronts the bias introduced by training data that omits non-standard Basque varieties. It introduces BERnaT, a family of encoder-only models trained on standard, diverse, and combined Basque corpora, and a dual-track evaluation framework that splits tasks into standard and diverse subsets to measure generalization. The results show that combining standard and diverse pretraining data yields superior performance across standard and diverse NLU tasks, especially for larger models, without sacrificing standard benchmark accuracy. The study provides new Basque resources, including a large diverse corpus and benchmark datasets, to promote inclusive, robust language technology for low-resource languages.

Abstract

Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal, historical, informal, etc.) rather than relying solely on standardized text. Focusing on Basque, a morphologically rich and low-resource language, we construct new corpora combining standard, social media, and historical sources, and pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and diverse data consistently outperform those trained on standard corpora, improving performance across all task types without compromising standard benchmark accuracy. These findings highlight the importance of linguistic diversity in building inclusive, generalizable language models.

BERnaT: Basque Encoders for Representing Natural Textual Diversity

TL;DR

This work confronts the bias introduced by training data that omits non-standard Basque varieties. It introduces BERnaT, a family of encoder-only models trained on standard, diverse, and combined Basque corpora, and a dual-track evaluation framework that splits tasks into standard and diverse subsets to measure generalization. The results show that combining standard and diverse pretraining data yields superior performance across standard and diverse NLU tasks, especially for larger models, without sacrificing standard benchmark accuracy. The study provides new Basque resources, including a large diverse corpus and benchmark datasets, to promote inclusive, robust language technology for low-resource languages.

Abstract

Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal, historical, informal, etc.) rather than relying solely on standardized text. Focusing on Basque, a morphologically rich and low-resource language, we construct new corpora combining standard, social media, and historical sources, and pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and diverse data consistently outperform those trained on standard corpora, improving performance across all task types without compromising standard benchmark accuracy. These findings highlight the importance of linguistic diversity in building inclusive, generalizable language models.

Paper Structure

This paper contains 28 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Summary of corpora combinations used to train BERnaT models. Standard Corpus: high-quality standard Basque from sources like News or Wikipedia. Diverse Corpora: social media and historical texts capturing informal, dialectal, and pre-standard Basque.
  • Figure 2: Visualization of diversity distribution for latxa standard corpora (wikipedia, egunkaria, euscrawl-v1.1, colossal-oscar, CulturaX, hplt-v1, booktegi) and newly added BSM and EKC non-standard corpora.
  • Figure 3: Average performance of Diverse, Standard and combined models in three sizes on Standard, Diverse and all tasks.
  • Figure 4: Average performance of Diverse, Standard and combined models, with results grouped according to training size and diversity of the task as presented in Table \ref{['tab:basque-dataset-sizes']} (standard/diverse). Each plot corresponds to the size of the models (medium/base/large).