Building a Strong Instruction Language Model for a Less-Resourced Language

Domen Vreš; Tjaša Arčon; Timotej Petrič; Dario Vajda; Marko Robnik-Šikonja; Iztok Lebar Bajec

Building a Strong Instruction Language Model for a Less-Resourced Language

Domen Vreš, Tjaša Arčon, Timotej Petrič, Dario Vajda, Marko Robnik-Šikonja, Iztok Lebar Bajec

TL;DR

GaMS3-12B, a generative model for Slovene with 12 billion parameters, is presented and it is demonstrated that it is the best-performing open-source model for Slovene within its parameter range, and outperforms 12B Gemma 3 across all three scenarios.

Abstract

Large language models (LLMs) have become an essential tool for natural language processing and artificial intelligence in general. Current open-source models are primarily trained on English texts, resulting in poorer performance on less-resourced languages and cultures. We present a set of methodological approaches necessary for the successful adaptation of an LLM to a less-resourced language, and demonstrate them using the Slovene language. We present GaMS3-12B, a generative model for Slovene with 12 billion parameters, and demonstrate that it is the best-performing open-source model for Slovene within its parameter range. We adapted the model to the Slovene language using three-stage continual pre-training of the Gemma 3 model, followed by two-stage supervised fine-tuning (SFT). We trained the model on a combination of 140B Slovene, English, Bosnian, Serbian, and Croatian pretraining tokens, and over 200 thousand English and Slovene SFT examples. We evaluate GaMS3-12B on the Slovenian-LLM-Eval datasets, English-to-Slovene translation, and the Slovene LLM arena. We show that the described model outperforms 12B Gemma 3 across all three scenarios and performs comparably to much larger commercial GPT-4o in the Slovene LLM arena, achieving a win rate of over 60 %.

Building a Strong Instruction Language Model for a Less-Resourced Language

TL;DR

Abstract

Paper Structure (33 sections, 1 figure, 10 tables)

This paper contains 33 sections, 1 figure, 10 tables.

Introduction
Related work
Continual pretraining
Parallel alignment
Base CPT
Long CPT
OCR of Slovene PDFs
OCR stage
Post-processing stage
Filtering stage
Final data preparation
Supervised fine-tuning
GaMS-Instruct dataset
Dataset construction
Open question answering
...and 18 more sections

Figures (1)

Figure 1: Our three-stage OCR pipeline. Each PDF document is first converted to markdown using an OCR model. In case of Llama 4 and Nanonets, the document is OCRed page by page, and the pages are then post-processed using Gemma 3 27B it. With the marker library, these two stages are merged inside the library. In the final stage, the text cleaning is performed using NeMo Curator filters. The diagram chart was AI-generated using the Nano Banana Pro tool.

Building a Strong Instruction Language Model for a Less-Resourced Language

TL;DR

Abstract

Building a Strong Instruction Language Model for a Less-Resourced Language

Authors

TL;DR

Abstract

Table of Contents

Figures (1)