Small Languages, Big Models: A Study of Continual Training on Languages of Norway

David Samuel; Vladislav Mikhailov; Erik Velldal; Lilja Øvrelid; Lucas Georges Gabriel Charpentier; Andrey Kutuzov; Stephan Oepen

Small Languages, Big Models: A Study of Continual Training on Languages of Norway

David Samuel, Vladislav Mikhailov, Erik Velldal, Lilja Øvrelid, Lucas Georges Gabriel Charpentier, Andrey Kutuzov, Stephan Oepen

TL;DR

This work tackles data scarcity for Norwegian Bokmål, Nynorsk, and Northern Sámi by introducing a three-stage continual pretraining pipeline that repurposes an English-centric model for target Nordic languages. The approach involves (1) a tokenizer replacement tailored to the target corpus, (2) embedding realignment to accommodate new subword tokens, and (3) full model training, augmented by a hybrid masked-causal objective to enable flexible inference. The resulting NorMistral-11B achieves state-of-the-art results on several Norwegian tasks, demonstrates practical efficiency gains, and is openly released along with smaller models and a Northern Sámi corpus. The work also includes thorough ablations and comparisons to baselines, showing benefits in cross-language knowledge transfer and data-constrained scaling, while acknowledging limitations in Sámi evaluation and computational costs. Overall, the paper provides a scalable blueprint for adapting large language models to low-resource languages with transparent data usage and open science commitments.

Abstract

Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern Sámi. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B.

Small Languages, Big Models: A Study of Continual Training on Languages of Norway

TL;DR

Abstract

Small Languages, Big Models: A Study of Continual Training on Languages of Norway

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)