Bilingual Adaptation of Monolingual Foundation Models
Gurpreet Gosal, Yishi Xu, Gokul Ramakrishnan, Rituraj Joshi, Avraham Sheinin, Zhiming, Chen, Biswajit Mishra, Natalia Vassilieva, Joel Hestness, Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Onkar Pandit, Satheesh Katipomu, Samta Kamboj, Samujjwal Ghosh, Rahul Pal, Parvez Mullah, Soundar Doraiswamy, Mohamed El Karim Chami, Preslav Nakov
TL;DR
The paper tackles translating English-dominant foundation models to Arabic (and Hindi) without catastrophic forgetting, introducing a two-stage recipe that combines vocabulary extension with embedding alignment and embedding-only pre-training before bilingual continual pre-training on mixed data. Through extensive ablations and hyperparameter tuning, the authors demonstrate substantial Arabic gains on Llama 2 and Llama 3, with modest or positive transfer to English and successful Hindi adaptation, validating cross-lingual transfer as a cost-effective approach. The method is shown to generalize beyond the tested languages and models, and alternative strategies such as block expansion adapters are explored to reduce training costs. The work provides a practical, data-driven recipe for cross-lingual adaptation of monolingual foundation models, with implications for expanding capabilities in low-resource languages.
Abstract
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language, addressing challenges of catastrophic forgetting and tokenizer limitations. We focus this study on adapting Llama 2 to Arabic. Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix, followed by full model continual pre-training on a bilingual corpus. By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic. Our approach results in significant improvements in Arabic and slight enhancements in English, demonstrating cost-effective cross-lingual transfer. We perform ablations on embedding initialization techniques, data mix ratios, and learning rates and release a detailed training recipe. To demonstrate generalizability of this approach we also adapted Llama 3 8B to Arabic and Llama 2 13B to Hindi.
