Continual Learning Under Language Shift

Evangelia Gogoulou; Timothée Lesort; Magnus Boman; Joakim Nivre

Continual Learning Under Language Shift

Evangelia Gogoulou, Timothée Lesort, Magnus Boman, Joakim Nivre

TL;DR

The pros and cons of updating a language model when new data comes from new languages -- the case of continual learning under language shift is studied and a combination of language contamination and syntactic similarity best fits the results.

Abstract

The recent increase in data and model scale for language model pre-training has led to huge training costs. In scenarios where new data become available over time, updating a model instead of fully retraining it would therefore provide significant gains. We study the pros and cons of updating a language model when new data comes from new languages -- the case of continual learning under language shift. Starting from a monolingual English language model, we incrementally add data from Danish, Icelandic, and Norwegian to investigate how forward and backward transfer effects depend on pre-training order and characteristics of languages, for three different model sizes. Our results show that, while forward transfer is largely positive and independent of language order, backward transfer can be positive or negative depending on the order and characteristics of new languages. We explore a number of potentially explanatory factors and find that a combination of language contamination and syntactic similarity best fits our results.

Continual Learning Under Language Shift

TL;DR

Abstract

Paper Structure (15 sections, 3 figures, 5 tables)

This paper contains 15 sections, 3 figures, 5 tables.

Introduction
Related Work
Method
Learning Scenario and Setup
Model Architecture and Training
Datasets
Language Similarity Metrics
Language Contamination
Results
Forward and Backward Transfer
Model Size
Discussion
Conclusion
Acknowledgements
Experimental Details

Figures (3)

Figure 1: Test loss on Danish, Icelandic, and Norwegian when learned at different stages. Clear improvement in the loss is observed when the language is learned later in the sequence, except for the $126M$ model trained on Icelandic.
Figure 2: Suffix length refers to the number of languages added after then one visualised. Test loss on English (Left) and on Danish, Icelandic, and Norwegian (Right) for models with varying size and language suffixes. Overall, Icelandic always causes forgetting to the other languages, while positive (or weaker negative) transfer is observed between Danish and Norwegian, and from those two languages to English.
Figure 3: Left: Cumulative loss over all language test sets, averaged per model size at the final stage. Growing the model size from $126$M to $356$M and then to $1.3$B leads to a drop of the model test loss on average by $8$% and $11$% respectively. Right: Test loss in the current language vs forgetting ( i.e. loss growth on previous languages ). For a given language and stage (color and shape), increasing the size of the model consistently decreases the current loss while forgetting remains mostly the same.

Continual Learning Under Language Shift

TL;DR

Abstract

Continual Learning Under Language Shift

Authors

TL;DR

Abstract

Table of Contents

Figures (3)