Table of Contents
Fetching ...

Cross-Lingual Optimization for Language Transfer in Large Language Models

Jungseob Lee, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim

TL;DR

This work tackles the difficulty of transferring English-centric LLMs to non-English languages under data scarcity. It introduces Cross-Lingual Optimization (CLO), a cross-lingual training paradigm that leverages a small amount of English SFT data plus translation-based cross-lingual data to align outputs with the input language while preserving English proficiency, by training only the attention layers and combining a target-language NLL loss with a cross-lingual loss. Across six languages and five models, CLO consistently outperforms standard supervised fine-tuning (SFT) and SFT+DPO in target-language proficiency and English retention, with notable data efficiency gains in low-resource languages. The approach relies on translation models to generate cross-lingual data and a batch-based loss that explicitly links language input and output, enabling effective utilization of embedded English knowledge for target-language generation. Limitations include the focus on six languages, potential translation artifacts, and the need to validate CLO across other optimization paradigms and broader language coverage; nevertheless, CLO offers a practical, data-efficient path for multilingual deployment of English-centric LLMs.

Abstract

Adapting large language models to other languages typically employs supervised fine-tuning (SFT) as a standard approach. However, it often suffers from an overemphasis on English performance, a phenomenon that is especially pronounced in data-constrained environments. To overcome these challenges, we propose \textbf{Cross-Lingual Optimization (CLO)} that efficiently transfers an English-centric LLM to a target language while preserving its English capabilities. CLO utilizes publicly available English SFT data and a translation model to enable cross-lingual transfer. We conduct experiments using five models on six languages, each possessing varying levels of resource. Our results show that CLO consistently outperforms SFT in both acquiring target language proficiency and maintaining English performance. Remarkably, in low-resource languages, CLO with only 3,200 samples surpasses SFT with 6,400 samples, demonstrating that CLO can achieve better performance with less data. Furthermore, we find that SFT is particularly sensitive to data quantity in medium and low-resource languages, whereas CLO remains robust. Our comprehensive analysis emphasizes the limitations of SFT and incorporates additional training strategies in CLO to enhance efficiency.

Cross-Lingual Optimization for Language Transfer in Large Language Models

TL;DR

This work tackles the difficulty of transferring English-centric LLMs to non-English languages under data scarcity. It introduces Cross-Lingual Optimization (CLO), a cross-lingual training paradigm that leverages a small amount of English SFT data plus translation-based cross-lingual data to align outputs with the input language while preserving English proficiency, by training only the attention layers and combining a target-language NLL loss with a cross-lingual loss. Across six languages and five models, CLO consistently outperforms standard supervised fine-tuning (SFT) and SFT+DPO in target-language proficiency and English retention, with notable data efficiency gains in low-resource languages. The approach relies on translation models to generate cross-lingual data and a batch-based loss that explicitly links language input and output, enabling effective utilization of embedded English knowledge for target-language generation. Limitations include the focus on six languages, potential translation artifacts, and the need to validate CLO across other optimization paradigms and broader language coverage; nevertheless, CLO offers a practical, data-efficient path for multilingual deployment of English-centric LLMs.

Abstract

Adapting large language models to other languages typically employs supervised fine-tuning (SFT) as a standard approach. However, it often suffers from an overemphasis on English performance, a phenomenon that is especially pronounced in data-constrained environments. To overcome these challenges, we propose \textbf{Cross-Lingual Optimization (CLO)} that efficiently transfers an English-centric LLM to a target language while preserving its English capabilities. CLO utilizes publicly available English SFT data and a translation model to enable cross-lingual transfer. We conduct experiments using five models on six languages, each possessing varying levels of resource. Our results show that CLO consistently outperforms SFT in both acquiring target language proficiency and maintaining English performance. Remarkably, in low-resource languages, CLO with only 3,200 samples surpasses SFT with 6,400 samples, demonstrating that CLO can achieve better performance with less data. Furthermore, we find that SFT is particularly sensitive to data quantity in medium and low-resource languages, whereas CLO remains robust. Our comprehensive analysis emphasizes the limitations of SFT and incorporates additional training strategies in CLO to enhance efficiency.

Paper Structure

This paper contains 40 sections, 12 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Example responses to a Swahili query generated by English-centric instruction models, the SFT model, and the proposed CLO model.
  • Figure 2: Overview of cross-lingual dataset preparation and optimization method. The process begins with translating English ($x_{\text{en}}$, $y_{\text{en}}$) pairs into a target language to create a cross-lingual dataset. This process results in the creation of ($x_{\ell}$, $y_{\ell}$) pairs in the target language. The optimization is performed using a combined loss $\mathcal{L}_{\text{CLO}}$.
  • Figure 3: Comparison of win rates between CLO and SFT on Llama-2-7B models trained with varying amounts of data, evaluated against a SFT with 6,400 pair examples on the AlpacaEval. The 'SFT Assumed' baseline is assigned a win rate of 50%, as it compares identical models.
  • Figure 4: Comparison of average MMMLU performance by category for CLO and SFT models of Llama-2 and Llama-3 in Chinese, Korean, and Swahili languages.
  • Figure 5: Comparison of win rates between CLO and SFT Llama-3 models, trained with varying amounts of data, against a model fine-tuned using the SFT method with 6,400 pair examples on the AlpacaEval dataset. The "SFT Assumed" baseline is assigned a win rate of 50% since it compares the same model and represents the ideal performance of an SFT model trained with fewer than 6,400 pairs.