SambaLingo: Teaching Large Language Models New Languages

Zoltan Csaki; Bo Li; Jonathan Li; Qiantong Xu; Pian Pawakapan; Leon Zhang; Yun Du; Hengyu Zhao; Changran Hu; Urmish Thakker

SambaLingo: Teaching Large Language Models New Languages

Zoltan Csaki, Bo Li, Jonathan Li, Qiantong Xu, Pian Pawakapan, Leon Zhang, Yun Du, Hengyu Zhao, Changran Hu, Urmish Thakker

TL;DR

SambaLingo presents a practical protocol for adapting English-centric LLMs to new languages via vocabulary extension, continual pre-training, and human-preference alignment. The method is validated across 9 typologically diverse languages and two model scales (7B and 70B), achieving state-of-the-art results against baselines and enabling open release of code and checkpoints. Key contributions include guidance on vocabulary expansion, embedding initialization, DPO data mixtures, and the importance of base-model quality. The approach demonstrates strong cross-language performance, scalable gains with 70B models, and robust evaluation including GPT-4 and Claude Opus judgments, with implications for democratizing language model capabilities beyond a handful of languages.

Abstract

Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages. One approach to address these issues has been to take an existing pre-trained LLM and continue to train it on new languages. While prior works have experimented with language adaptation, many questions around best practices and methodology have not been covered. In this paper, we present a comprehensive investigation into the adaptation of LLMs to new languages. Our study covers the key components in this process, including vocabulary extension, direct preference optimization and the data scarcity problem for human alignment in low-resource languages. We scale these experiments across 9 languages and 2 parameter scales (7B and 70B). We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing language experts, outperforming all prior published baselines. Additionally, all evaluation code and checkpoints are made public to facilitate future research.

SambaLingo: Teaching Large Language Models New Languages

TL;DR

Abstract

Paper Structure (33 sections, 14 figures, 9 tables)

This paper contains 33 sections, 14 figures, 9 tables.

Introduction
Related Work
Adaptation Methodology
Selecting a Base Model
Extending Model Vocabulary
Continual Pre-training
Aligning To Human Preferences In Other Languages
Evaluation
Quantitative Evaluation
Quantitative Results
Scaling to 70B
Evaluating Human Aligned Checkpoints
GPT-4 as a Judge
Qualitative Results
Ablations
...and 18 more sections

Figures (14)

Figure 1: Evaluation perplexity on hold out dataset, we also evaluate perplexity over wikipedia and Mc4 in appendix \ref{['MRT']}. Open source expert baselines: Japanese - Swallow-7b-hf swallow, Thai: typhoon-7b pipatanakul2023typhoon, Arabic: jais-13b sengupta2023jais, Hungarian: PULI-GPTrio yang-puli-gptrio, Russian: saiga-7b IlyaGusevsaiga, Bulgarian: mGPT-bulgarianshliazhko2023mgpt. We could not find Serbian, Slovenian and Turkish languages models with low enough perplexity that would fit the graph so we chose to omit them here to ensure readability.
Figure 2: Quantitative evaluation results. The "best open source experts" are the same as ones specified in Figure \ref{['fig:perplexity_results']}. See Appendix \ref{['MRT']} for the full breakdown.
Figure 3: GPT4 evaluation result
Figure 4: Training loss for different token initialization methods
Figure 5: Tokenizer Fertility: the average number of tokens per "word" acs2019
...and 9 more figures

SambaLingo: Teaching Large Language Models New Languages

TL;DR

Abstract

SambaLingo: Teaching Large Language Models New Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (14)