Table of Contents
Fetching ...

GPT-SW3: An Autoregressive Language Model for the Nordic Languages

Ariel Ekgren, Amaru Cuba Gyllensten, Felix Stollenwerk, Joey Öhman, Tim Isbister, Evangelia Gogoulou, Fredrik Carlsson, Alice Heiman, Judit Casademont, Magnus Sahlgren

Abstract

This paper details the process of developing the first native large generative language model for the Nordic languages, GPT-SW3. We cover all parts of the development process, from data collection and processing, training configuration and instruction finetuning, to evaluation and considerations for release strategies. We hope that this paper can serve as a guide and reference for other researchers that undertake the development of large generative models for smaller languages.

GPT-SW3: An Autoregressive Language Model for the Nordic Languages

Abstract

This paper details the process of developing the first native large generative language model for the Nordic languages, GPT-SW3. We cover all parts of the development process, from data collection and processing, training configuration and instruction finetuning, to evaluation and considerations for release strategies. We hope that this paper can serve as a guide and reference for other researchers that undertake the development of large generative models for smaller languages.
Paper Structure (13 sections, 4 equations, 3 figures, 10 tables)

This paper contains 13 sections, 4 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Normalized learning rate schedule. The maxima of the learning rate are given in Table \ref{['tab:training_runs']}.
  • Figure 2: Validation loss during training.
  • Figure 3: Scaling behaviour of GPT-SW3. The validation loss is shown as a function of the model size, while the dataset size is kept constant at 320B tokens for all models. The 20B parameter model (empty circle) is excluded from the fit (dashed curve). The gray, solid curve represents the scaling law from hoffmann2022an.