Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task

Gaëtan Caillaut; Raheel Qader; Mariam Nakhlé; Jingshu Liu; Jean-Gabriel Barthélemy

Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task

Gaëtan Caillaut, Raheel Qader, Mariam Nakhlé, Jingshu Liu, Jean-Gabriel Barthélemy

TL;DR

This work conducted a series of experiments showing that the loss of decoder-only models can be estimated using a scaling law similar to the one discovered for large language models, but it is shown that this scaling law has difficulties to generalize to too large models or to a different data distribution.

Abstract

Recent studies have showcased remarkable capabilities of decoder-only models in many NLP tasks, including translation. Yet, the machine translation field has been largely dominated by encoder-decoder models based on the Transformer architecture. As a consequence, scaling laws of encoder-decoder models for neural machine translation have already been well studied, but decoder-only models have received less attention. This work explores the scaling laws of decoder-only models on the multilingual and multidomain translation task. We trained a collection of six decoder-only models, ranging from 70M to 7B parameters, on a sentence-level, multilingual and multidomain dataset. We conducted a series of experiments showing that the loss of decoder-only models can be estimated using a scaling law similar to the one discovered for large language models, but we also show that this scaling law has difficulties to generalize to too large models or to a different data distribution. We also study different scaling methods and show that scaling the depth and the width of a model lead to similar test loss improvements, but with different impact on the model's efficiency.

Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task

TL;DR

Abstract

Paper Structure (18 sections, 5 equations, 11 figures, 5 tables)

This paper contains 18 sections, 5 equations, 11 figures, 5 tables.

Introduction
Background
Training methodology
Data
Tokenizer
Data format
The <eos> token issue
Training strategy
Model architectures
Experiments and results
Applying machine translation scaling law
Applying language modeling scaling law
Correlating scaling law with real translation quality
Scaling strategies
Conclusion
...and 3 more sections

Figures (11)

Figure 1: Test loss of our three smallest models (70M, 160M and 410M) with and without the <eos> prefix.
Figure 2: Test loss of all model checkpoints. Each step represents 512 training samples. Larger models always converge faster given the same amount of training data.
Figure 3: Test losses estimated by power law fitted on different subset of models. Laws fitted on all models and 70M-160M-410M-1B models subset match our observations.
Figure 4: Scaling law fitted on the general domain and some financial subdomains. The law are fitted on the English-French direction only.
Figure 5: Scaling law fitted on the general domain for all English-X direction.
...and 6 more figures

Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task

TL;DR

Abstract

Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task

Authors

TL;DR

Abstract

Table of Contents

Figures (11)