Table of Contents
Fetching ...

Latxa: An Open Language Model and Evaluation Suite for Basque

Julen Etxaniz, Oscar Sainz, Naiara Perez, Itziar Aldabe, German Rigau, Eneko Agirre, Aitor Ormazabal, Mikel Artetxe, Aitor Soroa

TL;DR

This work addresses Basque as a low-resource language by presenting Latxa, an open family of Llama 2-based LLMs (7B, 13B, 70B) obtained via continued pretraining on a curated Basque corpus (4.3M documents, 4.2B tokens; final 1.22B words, 4.17B tokens). It also introduces four Basque MC benchmarks (EusProficiency, EusReading, EusTrivia, EusExams) to enable reproducible evaluation, and provides extensive comparisons with open and closed models across language proficiency, reading, and knowledge tasks. Latxa achieves strong open-model performance, with 70B surpassing previous open models and even rivaling GPT-4 Turbo on some Basque-specific tasks, while lagging on certain reading and knowledge-intensive tasks; results suggest that leveraging stronger English-language models via continued pretraining could further boost Basque capabilities. The work emphasizes open data, models, and evaluation resources to foster reproducible Basque LLM research and lays groundwork for extending Basque NLP to broader domains such as truthfulness and instruction-following. Overall, Latxa demonstrates the viability of scaling and continuing pretraining to build high-quality Basque LLMs and provides a substantial, openly shareable benchmark suite for future work in low-resource language modeling.

Abstract

We introduce Latxa, a family of large language models for Basque ranging from 7 to 70 billion parameters. Latxa is based on Llama 2, which we continue pretraining on a new Basque corpus comprising 4.3M documents and 4.2B tokens. Addressing the scarcity of high-quality benchmarks for Basque, we further introduce 4 multiple choice evaluation datasets: EusProficiency, comprising 5,169 questions from official language proficiency exams; EusReading, comprising 352 reading comprehension questions; EusTrivia, comprising 1,715 trivia questions from 5 knowledge areas; and EusExams, comprising 16,774 questions from public examinations. In our extensive evaluation, Latxa outperforms all previous open models we compare to by a large margin. In addition, it is competitive with GPT-4 Turbo in language proficiency and understanding, despite lagging behind in reading comprehension and knowledge-intensive tasks. Both the Latxa family of models, as well as our new pretraining corpora and evaluation datasets, are publicly available under open licenses. Our suite enables reproducible research on methods to build LLMs for low-resource languages.

Latxa: An Open Language Model and Evaluation Suite for Basque

TL;DR

This work addresses Basque as a low-resource language by presenting Latxa, an open family of Llama 2-based LLMs (7B, 13B, 70B) obtained via continued pretraining on a curated Basque corpus (4.3M documents, 4.2B tokens; final 1.22B words, 4.17B tokens). It also introduces four Basque MC benchmarks (EusProficiency, EusReading, EusTrivia, EusExams) to enable reproducible evaluation, and provides extensive comparisons with open and closed models across language proficiency, reading, and knowledge tasks. Latxa achieves strong open-model performance, with 70B surpassing previous open models and even rivaling GPT-4 Turbo on some Basque-specific tasks, while lagging on certain reading and knowledge-intensive tasks; results suggest that leveraging stronger English-language models via continued pretraining could further boost Basque capabilities. The work emphasizes open data, models, and evaluation resources to foster reproducible Basque LLM research and lays groundwork for extending Basque NLP to broader domains such as truthfulness and instruction-following. Overall, Latxa demonstrates the viability of scaling and continuing pretraining to build high-quality Basque LLMs and provides a substantial, openly shareable benchmark suite for future work in low-resource language modeling.

Abstract

We introduce Latxa, a family of large language models for Basque ranging from 7 to 70 billion parameters. Latxa is based on Llama 2, which we continue pretraining on a new Basque corpus comprising 4.3M documents and 4.2B tokens. Addressing the scarcity of high-quality benchmarks for Basque, we further introduce 4 multiple choice evaluation datasets: EusProficiency, comprising 5,169 questions from official language proficiency exams; EusReading, comprising 352 reading comprehension questions; EusTrivia, comprising 1,715 trivia questions from 5 knowledge areas; and EusExams, comprising 16,774 questions from public examinations. In our extensive evaluation, Latxa outperforms all previous open models we compare to by a large margin. In addition, it is competitive with GPT-4 Turbo in language proficiency and understanding, despite lagging behind in reading comprehension and knowledge-intensive tasks. Both the Latxa family of models, as well as our new pretraining corpora and evaluation datasets, are publicly available under open licenses. Our suite enables reproducible research on methods to build LLMs for low-resource languages.
Paper Structure (30 sections, 2 figures, 13 tables)

This paper contains 30 sections, 2 figures, 13 tables.

Figures (2)

  • Figure 1: Validation perplexity throughout training.
  • Figure 2: Basic corpus quality statistics before preprocessing