Table of Contents
Fetching ...

Pipeline Analysis for Developing Instruct LLMs in Low-Resource Languages: A Case Study on Basque

Ander Corral, Ixak Sarasua, Xabier Saralegi

TL;DR

This study正式izes a pipeline for developing instruction-following LLMs in a low-resource language, Basque, by applying continual pre-training, Basque-focused instruction tuning with translated datasets, and human-preference alignment. Using a sub-10B base, Llama-3.1-8B, and a Basque-adapted variant Llama-eus-8B, the work demonstrates substantial gains: over 12 points in Basque NLU from targeted pre-training and around 24 points in instruction-following from translation-based datasets and alignment, yielding state-of-the-art results in the Basque sub-10B regime. The study also shows that Basque-specific bases with translated instruction and preference data outperform English baselines, though a noticeable Basque–English performance gap remains. Overall, the results validate a practical, language-tailored approach to expanding LLM capabilities for Basque, with implications for other low-resource languages, albeit with acknowledged limitations and ethical considerations.

Abstract

Large language models (LLMs) are typically optimized for resource-rich languages like English, exacerbating the gap between high-resource and underrepresented languages. This work presents a detailed analysis of strategies for developing a model capable of following instructions in a low-resource language, specifically Basque, by focusing on three key stages: pre-training, instruction tuning, and alignment with human preferences. Our findings demonstrate that continual pre-training with a high-quality Basque corpus of around 600 million words improves natural language understanding (NLU) of the foundational model by over 12 points. Moreover, instruction tuning and human preference alignment using automatically translated datasets proved highly effective, resulting in a 24-point improvement in instruction-following performance. The resulting models, Llama-eus-8B and Llama-eus-8B-instruct, establish a new state-of-the-art for Basque in the sub-10B parameter category.

Pipeline Analysis for Developing Instruct LLMs in Low-Resource Languages: A Case Study on Basque

TL;DR

This study正式izes a pipeline for developing instruction-following LLMs in a low-resource language, Basque, by applying continual pre-training, Basque-focused instruction tuning with translated datasets, and human-preference alignment. Using a sub-10B base, Llama-3.1-8B, and a Basque-adapted variant Llama-eus-8B, the work demonstrates substantial gains: over 12 points in Basque NLU from targeted pre-training and around 24 points in instruction-following from translation-based datasets and alignment, yielding state-of-the-art results in the Basque sub-10B regime. The study also shows that Basque-specific bases with translated instruction and preference data outperform English baselines, though a noticeable Basque–English performance gap remains. Overall, the results validate a practical, language-tailored approach to expanding LLM capabilities for Basque, with implications for other low-resource languages, albeit with acknowledged limitations and ethical considerations.

Abstract

Large language models (LLMs) are typically optimized for resource-rich languages like English, exacerbating the gap between high-resource and underrepresented languages. This work presents a detailed analysis of strategies for developing a model capable of following instructions in a low-resource language, specifically Basque, by focusing on three key stages: pre-training, instruction tuning, and alignment with human preferences. Our findings demonstrate that continual pre-training with a high-quality Basque corpus of around 600 million words improves natural language understanding (NLU) of the foundational model by over 12 points. Moreover, instruction tuning and human preference alignment using automatically translated datasets proved highly effective, resulting in a 24-point improvement in instruction-following performance. The resulting models, Llama-eus-8B and Llama-eus-8B-instruct, establish a new state-of-the-art for Basque in the sub-10B parameter category.

Paper Structure

This paper contains 24 sections, 1 figure, 12 tables.

Figures (1)

  • Figure 1: Comparison of Basque performance between our Llama-eus models and Llama-3.1 baselines. This includes the foundational models' performance on NLU tasks (see Section \ref{['sec:cpt']}) and the instruction-following performance of instructed models (see Sections \ref{['sec:instructing']} and \ref{['sec:alignment']}). In the instructed models, lighter colors indicate partially correct answers.