Small Languages, Big Models: A Study of Continual Training on Languages of Norway
David Samuel, Vladislav Mikhailov, Erik Velldal, Lilja Øvrelid, Lucas Georges Gabriel Charpentier, Andrey Kutuzov, Stephan Oepen
TL;DR
This work tackles data scarcity for Norwegian Bokmål, Nynorsk, and Northern Sámi by introducing a three-stage continual pretraining pipeline that repurposes an English-centric model for target Nordic languages. The approach involves (1) a tokenizer replacement tailored to the target corpus, (2) embedding realignment to accommodate new subword tokens, and (3) full model training, augmented by a hybrid masked-causal objective to enable flexible inference. The resulting NorMistral-11B achieves state-of-the-art results on several Norwegian tasks, demonstrates practical efficiency gains, and is openly released along with smaller models and a Northern Sámi corpus. The work also includes thorough ablations and comparisons to baselines, showing benefits in cross-language knowledge transfer and data-constrained scaling, while acknowledging limitations in Sámi evaluation and computational costs. Overall, the paper provides a scalable blueprint for adapting large language models to low-resource languages with transparent data usage and open science commitments.
Abstract
Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern Sámi. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B.
