From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora
Yingli Shen, Wen Lai, Shuo Wang, Ge Gao, Kangyang Luo, Alexander Fraser, Maosong Sun
TL;DR
This work tackles the limited cross-lingual efficacy of unaligned multilingual data by introducing TED2025, a large-scale, high-quality multi-way parallel corpus spanning 113 languages with up to 50-way alignment. Through systematic experiments on continued pretraining and instruction tuning, the authors demonstrate that multi-way parallel data yields consistent gains over unaligned data across six multilingual benchmarks, improving downstream tasks, zero-shot transfer, and internal representation alignment. The study also analyzes factors shaping these gains, including parallelism degree, English pivot effects, and language-family diversity, and finds that MT-oriented instruction tuning offers robust cross-lingual benefits while highlighting domain transfer limitations. Overall, TED2025 provides a scalable data foundation for advancing multilingual LLMs, with practical guidance on data selection, training strategies, and task objectives, though future work is needed to scale data further and explore alternative PEFT methods.
Abstract
Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.
