Extending LLMs to New Languages: A Case Study of Llama and Persian Adaptation
Samin Mahdizadeh Sani, Pouya Sadeghi, Thuy-Trang Vu, Yadollah Yaghoobzadeh, Gholamreza Haffari
TL;DR
This work addresses the challenge of extending a predominantly English LLM (Llama-2) to Persian via parameter-efficient fine-tuning. It combines vocabulary expansion, monolingual and bilingual pretraining, and instruction tuning, with LoRA-based generation and careful alignment of Persian and English representations. The study finds that bilingual alignment suffices for Persian classification and, when combined with limited data, that English-to-Persian knowledge transfer is only modestly beneficial, whereas additional Persian pretraining improves generation capabilities. The results highlight that Llama-2 may not be the optimal base for Persian, while models like Gemma and Qwen show stronger performance; the work provides a practical blueprint for low-resource language expansion through targeted fine-tuning and cross-lingual alignment. Overall, the paper demonstrates a viable path to widen multilingual coverage by balancing cross-language alignment, data availability, and task type, with significant implications for deploying LLMs in low-resource languages.
Abstract
Large language models (LLMs) have made great progress in classification and text generation tasks. However, they are mainly trained on English data and often struggle with low-resource languages. In this study, we explore adding a new language, i.e., Persian, to Llama (a model with a limited understanding of Persian) using parameter-efficient fine-tuning. We employ a multi-stage approach involving pretraining on monolingual Persian data, aligning representations through bilingual pretraining and instruction datasets, and instruction-tuning with task-specific datasets. We evaluate the model's performance at each stage on generation and classification tasks. Our findings suggest that incorporating the Persian language, through bilingual data alignment, can enhance classification accuracy for Persian tasks, with no adverse impact and sometimes even improvements on English tasks. Additionally, the results highlight the model's initial strength as a critical factor when working with limited training data, with cross-lingual alignment offering minimal benefits for the low-resource language. Knowledge transfer from English to Persian has a marginal effect, primarily benefiting simple classification tasks.
