Table of Contents
Fetching ...

BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment

Shaolei Zhang, Kehao Zhang, Qingkai Fang, Shoutao Guo, Yan Zhou, Xiaodong Liu, Yang Feng

TL;DR

BayLing 2 tackles the problem of English-centric data limiting multilingual reach in LLMs. It proposes a pivot-language based language alignment approach using Chinese and English, augmented with cross-lingual instruction tuning to transfer capabilities to 100+ languages. The method fine-tunes Llama-based backbones on $3.2$ million instructions, achieving strong translation and knowledge-transfer across Flores-101 and WMT22, with substantial gains on low-resource benchmarks. Ablation shows cross-lingual instructions are essential to prevent forgetting and inter-language conflicts, offering a cost-efficient path to scalable multilingual capability expansion. Overall, the work provides a practical framework for broadening LLM multilingual performance across diverse linguistic communities.

Abstract

Large language models (LLMs), with their powerful generative capabilities and vast knowledge, empower various tasks in everyday life. However, these abilities are primarily concentrated in high-resource languages, leaving low-resource languages with weaker generative capabilities and relatively limited knowledge. Enhancing the multilingual capabilities of LLMs is therefore crucial for serving over 100 linguistic communities worldwide. An intuitive approach to enhance the multilingual capabilities would be to construct instruction data for various languages, but constructing instruction data for over 100 languages is prohibitively costly. In this paper, we introduce BayLing 2, which efficiently transfers generative capabilities and knowledge from high-resource languages to low-resource languages through language alignment. To achieve this, we constructed a dataset of 3.2 million instructions, comprising high-resource language instructions (Chinese and English) and cross-lingual instructions for 100+ languages and performed instruction tuning based on the dataset to facilitate the capability transfer between languages. Using Llama as the foundation model, we developed BayLing-2-7B, BayLing-2-13B, and BayLing-2-8B, and conducted a comprehensive evaluation of BayLing. For multilingual translation across 100+ languages, BayLing shows superior performance compared to open-source models of similar scale. For multilingual knowledge and understanding benchmarks, BayLing achieves significant improvements across over 20 low-resource languages, demonstrating its capability of effective knowledge transfer from high-resource to low-resource languages. Furthermore, results on English benchmarks indicate that BayLing maintains high performance in highresource languages while enhancing the performance in low-resource languages. Demo, homepage, code and models of BayLing are available.

BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment

TL;DR

BayLing 2 tackles the problem of English-centric data limiting multilingual reach in LLMs. It proposes a pivot-language based language alignment approach using Chinese and English, augmented with cross-lingual instruction tuning to transfer capabilities to 100+ languages. The method fine-tunes Llama-based backbones on million instructions, achieving strong translation and knowledge-transfer across Flores-101 and WMT22, with substantial gains on low-resource benchmarks. Ablation shows cross-lingual instructions are essential to prevent forgetting and inter-language conflicts, offering a cost-efficient path to scalable multilingual capability expansion. Overall, the work provides a practical framework for broadening LLM multilingual performance across diverse linguistic communities.

Abstract

Large language models (LLMs), with their powerful generative capabilities and vast knowledge, empower various tasks in everyday life. However, these abilities are primarily concentrated in high-resource languages, leaving low-resource languages with weaker generative capabilities and relatively limited knowledge. Enhancing the multilingual capabilities of LLMs is therefore crucial for serving over 100 linguistic communities worldwide. An intuitive approach to enhance the multilingual capabilities would be to construct instruction data for various languages, but constructing instruction data for over 100 languages is prohibitively costly. In this paper, we introduce BayLing 2, which efficiently transfers generative capabilities and knowledge from high-resource languages to low-resource languages through language alignment. To achieve this, we constructed a dataset of 3.2 million instructions, comprising high-resource language instructions (Chinese and English) and cross-lingual instructions for 100+ languages and performed instruction tuning based on the dataset to facilitate the capability transfer between languages. Using Llama as the foundation model, we developed BayLing-2-7B, BayLing-2-13B, and BayLing-2-8B, and conducted a comprehensive evaluation of BayLing. For multilingual translation across 100+ languages, BayLing shows superior performance compared to open-source models of similar scale. For multilingual knowledge and understanding benchmarks, BayLing achieves significant improvements across over 20 low-resource languages, demonstrating its capability of effective knowledge transfer from high-resource to low-resource languages. Furthermore, results on English benchmarks indicate that BayLing maintains high performance in highresource languages while enhancing the performance in low-resource languages. Demo, homepage, code and models of BayLing are available.

Paper Structure

This paper contains 15 sections, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Overview of BayLing 2. BayLing 2 is a multilingual LLM with efficient language alignment. BayLing 2 designates Chinese and English, two high-resource languages, as pivot languages and applies cross-lingual tasks to align 100+ languages to these pivot languages, which facilitates the capabilities transfer from high-resource languages to low-resource languages. During inference, BayLing 2 is capable of high-quality interaction across multiple languages.
  • Figure 2: Language distribution of instruction dataset.
  • Figure 3: Distribution of instruction categories, including Chinese, English and cross-lingual instructions.
  • Figure 4: Distribution of the tokens number involved in each instruction.
  • Figure 5: Training loss curve of BayLing-2-8B.
  • ...and 6 more figures