Table of Contents
Fetching ...

A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

Peiqin Lin, André F. T. Martins, Hinrich Schütze

TL;DR

This work provides a practical recipe for exploiting parallel corpora to enhance multilingual large language models (mLLMs) by systematically evaluating four factors—data quality, data quantity, training objective, and model size—across diverse languages and tasks. It demonstrates that translation-quality matters most, that about $10K$ high-quality parallel sentences can yield near-optimal improvements, and that the machine translation (MT) objective typically delivers the strongest gains, especially for larger models with broad cross-task transfer. The findings offer actionable guidance for data curation and training strategies, extending previous insights beyond a narrow set of languages and tasks. Overall, the study highlights that larger mLLMs benefit more from parallel corpora, underscoring the importance of high-quality data and MT-centric training for robust multilingual capability.

Abstract

Recent studies have highlighted the potential of exploiting parallel corpora to enhance multilingual large language models, improving performance in both bilingual tasks, e.g., machine translation, and general-purpose tasks, e.g., text classification. Building upon these findings, our comprehensive study aims to identify the most effective strategies for leveraging parallel corpora. We investigate the impact of parallel corpora quality and quantity, training objectives, and model size on the performance of multilingual large language models enhanced with parallel corpora across diverse languages and tasks. Our analysis reveals several key insights: (i) filtering noisy translations is essential for effectively exploiting parallel corpora, while language identification and short sentence filtering have little effect; (ii) even a corpus with just 10K parallel sentences can yield results comparable to those obtained from much larger datasets; (iii) employing only the machine translation objective yields the best results among various training objectives and their combinations; (iv) larger multilingual language models benefit more from parallel corpora than smaller models. Our study offers valuable insights into the optimal utilization of parallel corpora to enhance multilingual large language models, extending the generalizability of previous findings from limited languages and tasks to a broader range of scenarios.

A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

TL;DR

This work provides a practical recipe for exploiting parallel corpora to enhance multilingual large language models (mLLMs) by systematically evaluating four factors—data quality, data quantity, training objective, and model size—across diverse languages and tasks. It demonstrates that translation-quality matters most, that about high-quality parallel sentences can yield near-optimal improvements, and that the machine translation (MT) objective typically delivers the strongest gains, especially for larger models with broad cross-task transfer. The findings offer actionable guidance for data curation and training strategies, extending previous insights beyond a narrow set of languages and tasks. Overall, the study highlights that larger mLLMs benefit more from parallel corpora, underscoring the importance of high-quality data and MT-centric training for robust multilingual capability.

Abstract

Recent studies have highlighted the potential of exploiting parallel corpora to enhance multilingual large language models, improving performance in both bilingual tasks, e.g., machine translation, and general-purpose tasks, e.g., text classification. Building upon these findings, our comprehensive study aims to identify the most effective strategies for leveraging parallel corpora. We investigate the impact of parallel corpora quality and quantity, training objectives, and model size on the performance of multilingual large language models enhanced with parallel corpora across diverse languages and tasks. Our analysis reveals several key insights: (i) filtering noisy translations is essential for effectively exploiting parallel corpora, while language identification and short sentence filtering have little effect; (ii) even a corpus with just 10K parallel sentences can yield results comparable to those obtained from much larger datasets; (iii) employing only the machine translation objective yields the best results among various training objectives and their combinations; (iv) larger multilingual language models benefit more from parallel corpora than smaller models. Our study offers valuable insights into the optimal utilization of parallel corpora to enhance multilingual large language models, extending the generalizability of previous findings from limited languages and tasks to a broader range of scenarios.
Paper Structure (27 sections, 4 figures, 9 tables)

This paper contains 27 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Average performance improvements (y-axis) achieved by mLLMs enhanced with parallel corpora compared to their base models. Best: Instruction tuning of BLOOM-7B1 with the machine translation objective (MT) using 10K high-quality (i.e., filtered) parallel sentences yields the best results. Main variations explored include: Filter (No) (using the original data); OBJ (TLM) (translation language modeling objective); OBJ (XSS) (cross-lingual semantic similarity objective); |Data| (50K) (a larger 50K-sentence dataset); |Model| (1B7) (BLOOM-1B7 model).
  • Figure 2: Translation quality measured by COMETWIKI of 500K parallel sentences from OPUS100 for our five language pairs. The COMETWIKI scores are segmented into four ranges: 0-0.25, 0.25-0.5, 0.5-0.75, and 0.75-1. Higher scores represent better translation quality.
  • Figure 3: Sentence length of 500K parallel sentences from OPUS100 for our five language pairs. The three categories are 0-5, 5-10, greater than 10 tokens.
  • Figure 4: Percentage of sentences retained after language identification filtering of 500K parallel sentences from OPUS100 for our five language pairs.