Table of Contents
Fetching ...

Efficiently Adapting Pretrained Language Models To New Languages

Zoltan Csaki, Pian Pawakapan, Urmish Thakker, Qiantong Xu

TL;DR

The paper addresses the challenge of adapting pretrained language models to low-resource languages by focusing on tokenizer efficiency and catastrophic forgetting.It proposes a practical method: replace a portion of the base tokenizer's tokens with new-language tokens and train with mixed-language data during pretraining and instruction tuning.Experiments adapting an English-centric GPT-2 model to Hungarian and Thai show improved target-language performance with minimal English regression, often surpassing open-source baselines, and ablations highlight the importance of tokenizer choice and data mixing.The work provides actionable guidance for efficient cross-lingual adaptation, including token replacement thresholds and the value of small amounts of target-language instruction tuning data.

Abstract

Recent large language models (LLM) exhibit sub-optimal performance on low-resource languages, as the training data of these models is usually dominated by English and other high-resource languages. Furthermore, it is challenging to train models for low-resource languages, especially from scratch, due to a lack of high quality training data. Adapting pretrained LLMs reduces the need for data in the new language while also providing cross lingual transfer capabilities. However, naively adapting to new languages leads to catastrophic forgetting and poor tokenizer efficiency. In this work, we study how to efficiently adapt any existing pretrained LLM to a new language without running into these issues. In particular, we improve the encoding efficiency of the tokenizer by adding new tokens from the target language and study the data mixing recipe to mitigate forgetting. Our experiments on adapting an English LLM to Hungarian and Thai show that our recipe can reach better performance than open source models on the target language, with minimal regressions on English.

Efficiently Adapting Pretrained Language Models To New Languages

TL;DR

The paper addresses the challenge of adapting pretrained language models to low-resource languages by focusing on tokenizer efficiency and catastrophic forgetting.It proposes a practical method: replace a portion of the base tokenizer's tokens with new-language tokens and train with mixed-language data during pretraining and instruction tuning.Experiments adapting an English-centric GPT-2 model to Hungarian and Thai show improved target-language performance with minimal English regression, often surpassing open-source baselines, and ablations highlight the importance of tokenizer choice and data mixing.The work provides actionable guidance for efficient cross-lingual adaptation, including token replacement thresholds and the value of small amounts of target-language instruction tuning data.

Abstract

Recent large language models (LLM) exhibit sub-optimal performance on low-resource languages, as the training data of these models is usually dominated by English and other high-resource languages. Furthermore, it is challenging to train models for low-resource languages, especially from scratch, due to a lack of high quality training data. Adapting pretrained LLMs reduces the need for data in the new language while also providing cross lingual transfer capabilities. However, naively adapting to new languages leads to catastrophic forgetting and poor tokenizer efficiency. In this work, we study how to efficiently adapt any existing pretrained LLM to a new language without running into these issues. In particular, we improve the encoding efficiency of the tokenizer by adding new tokens from the target language and study the data mixing recipe to mitigate forgetting. Our experiments on adapting an English LLM to Hungarian and Thai show that our recipe can reach better performance than open source models on the target language, with minimal regressions on English.
Paper Structure (32 sections, 3 figures, 6 tables)

This paper contains 32 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Fertility score of the bilingual tokenizers (left: Hungarian, right: Thai) with different number of tokens replaced by new language. The red lines represent the original GPT-2 tokenizer (50k vocabulary), while the green lines represent the tokenizer trained purely on the new language. Every tokenizer has the same total vocabulary size. The number of replaced tokens are in 10$^3$ scale.
  • Figure 2: Varying pretraining data mixtures. "EN" and "HU" models are monolingual models trained from scratch, while the other models are trained from the "EN" model with the labeled data mixture.
  • Figure 3: Model performance with different IT data mixture. ROUGE-2 score is reported for HU Sum, while accuracy and F1 scores are reported for the rest of the tasks.