Table of Contents
Fetching ...

Small Languages, Big Models: A Study of Continual Training on Languages of Norway

David Samuel, Vladislav Mikhailov, Erik Velldal, Lilja Øvrelid, Lucas Georges Gabriel Charpentier, Andrey Kutuzov, Stephan Oepen

TL;DR

This work tackles data scarcity for Norwegian Bokmål, Nynorsk, and Northern Sámi by introducing a three-stage continual pretraining pipeline that repurposes an English-centric model for target Nordic languages. The approach involves (1) a tokenizer replacement tailored to the target corpus, (2) embedding realignment to accommodate new subword tokens, and (3) full model training, augmented by a hybrid masked-causal objective to enable flexible inference. The resulting NorMistral-11B achieves state-of-the-art results on several Norwegian tasks, demonstrates practical efficiency gains, and is openly released along with smaller models and a Northern Sámi corpus. The work also includes thorough ablations and comparisons to baselines, showing benefits in cross-language knowledge transfer and data-constrained scaling, while acknowledging limitations in Sámi evaluation and computational costs. Overall, the paper provides a scalable blueprint for adapting large language models to low-resource languages with transparent data usage and open science commitments.

Abstract

Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern Sámi. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B.

Small Languages, Big Models: A Study of Continual Training on Languages of Norway

TL;DR

This work tackles data scarcity for Norwegian Bokmål, Nynorsk, and Northern Sámi by introducing a three-stage continual pretraining pipeline that repurposes an English-centric model for target Nordic languages. The approach involves (1) a tokenizer replacement tailored to the target corpus, (2) embedding realignment to accommodate new subword tokens, and (3) full model training, augmented by a hybrid masked-causal objective to enable flexible inference. The resulting NorMistral-11B achieves state-of-the-art results on several Norwegian tasks, demonstrates practical efficiency gains, and is openly released along with smaller models and a Northern Sámi corpus. The work also includes thorough ablations and comparisons to baselines, showing benefits in cross-language knowledge transfer and data-constrained scaling, while acknowledging limitations in Sámi evaluation and computational costs. Overall, the paper provides a scalable blueprint for adapting large language models to low-resource languages with transparent data usage and open science commitments.

Abstract

Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern Sámi. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B.

Paper Structure

This paper contains 103 sections, 3 figures, 18 tables.

Figures (3)

  • Figure 1: Language composition of training corpus The left figure shows the proportions of languages in the final corpus mixture, with the target languages of Norway in blue, related languages in red, and other data sources in gray. The right figure then displays the upsampling factors used to get the aforementioned proportions.
  • Figure 2: Three-stage continual pretraining We propose a novel continual pretraining pipeline consisting of creating a new tokenizer optimized for the training corpus, realigning the embedding weights to the new tokens, and training the full language model. Arrows symbolize changes between stages, while double-lines represent no changes.
  • Figure 3: Inference modes of NorMistral-11B The hybrid masked-causal pretraining allows the model to be more flexible during inference. It can not only serve as a unidirectional causal language model (left), but also as a fully bidirectional masked language model (middle), or as a partially bidirectional prefix language model (right). The diagrams illustrate possible attention connections.