Table of Contents
Fetching ...

From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora

Yingli Shen, Wen Lai, Shuo Wang, Ge Gao, Kangyang Luo, Alexander Fraser, Maosong Sun

TL;DR

This work tackles the limited cross-lingual efficacy of unaligned multilingual data by introducing TED2025, a large-scale, high-quality multi-way parallel corpus spanning 113 languages with up to 50-way alignment. Through systematic experiments on continued pretraining and instruction tuning, the authors demonstrate that multi-way parallel data yields consistent gains over unaligned data across six multilingual benchmarks, improving downstream tasks, zero-shot transfer, and internal representation alignment. The study also analyzes factors shaping these gains, including parallelism degree, English pivot effects, and language-family diversity, and finds that MT-oriented instruction tuning offers robust cross-lingual benefits while highlighting domain transfer limitations. Overall, TED2025 provides a scalable data foundation for advancing multilingual LLMs, with practical guidance on data selection, training strategies, and task objectives, though future work is needed to scale data further and explore alternative PEFT methods.

Abstract

Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.

From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora

TL;DR

This work tackles the limited cross-lingual efficacy of unaligned multilingual data by introducing TED2025, a large-scale, high-quality multi-way parallel corpus spanning 113 languages with up to 50-way alignment. Through systematic experiments on continued pretraining and instruction tuning, the authors demonstrate that multi-way parallel data yields consistent gains over unaligned data across six multilingual benchmarks, improving downstream tasks, zero-shot transfer, and internal representation alignment. The study also analyzes factors shaping these gains, including parallelism degree, English pivot effects, and language-family diversity, and finds that MT-oriented instruction tuning offers robust cross-lingual benefits while highlighting domain transfer limitations. Overall, TED2025 provides a scalable data foundation for advancing multilingual LLMs, with practical guidance on data selection, training strategies, and task objectives, though future work is needed to scale data further and explore alternative PEFT methods.

Abstract

Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.

Paper Structure

This paper contains 31 sections, 10 figures, 17 tables.

Figures (10)

  • Figure 1: Distribution of sentence counts (line chart, left y-axis, log scale) and parallelism spans (bar chart, right y-axis, ratio) across languages (x-axis) in the TED2025 corpus. The parallelism spans, with a notable concentration between 21 and 30 languages, and high range even for low-resource languages.
  • Figure 2: Comparison of translation quality between TED2025 and existing multi-way datasets, including TED2018 qi-etal-2018-pre, TED2020 reimers-gurevych-2020-making, MWccMatrix thompson-etal-2024-shocking, using COMET-QE score.
  • Figure 3: Cross-lingual transfer performance comparison between Baseline, Unaligned and Multi-Way pretraining on the FLORES-200 benchmark with BLEU (bar chart, left y-axis) and COMET (line chart, right y-axis) for LLaMA-3-8B and Qwen-2.5-14B models.
  • Figure 4: SVCCA alignment comparison between the Multi-Way, Unaligned and Baseline models across 32-way language pairs.
  • Figure 5: Performance (%) of continued pretraining models on downstream tasks with varying degrees of parallelism.
  • ...and 5 more figures