Table of Contents
Fetching ...

Asymmetric Conflict and Synergy in Post-training for LLM-based Multilingual Machine Translation

Tong Zheng, Yan Wen, Huiwen Bao, Junfeng Guo, Heng Huang

TL;DR

This work addresses the Curse of Multilinguality in LLM-based multilingual machine translation by uncovering an asymmetric pattern of linguistic conflicts versus synergy across translation directions during post-training. It introduces Direction-Aware Training (DAT) and group-wise model merging (DATM) to exploit this asymmetry, achieving strong translation quality starting from a relatively lightweight multilingual pretraining on $20$B tokens and a compact LoRA-based setup. Key findings show that XX→En directions suffer from conflicts that DAT mitigates, while En→XX directions benefit from synergy that is preserved by merging only in the XX→En direction, resulting in substantial efficiency gains with comparable Flores-200 and WMT23 performance. The approach demonstrates that careful, direction-specific post-training can significantly reduce pretraining cost and model size while maintaining high multilingual translation quality, offering a scalable path toward resource-efficient MMT across many languages.

Abstract

The emergence of Large Language Models (LLMs) has advanced the multilingual machine translation (MMT), yet the Curse of Multilinguality (CoM) remains a major challenge. Existing work in LLM-based MMT typically mitigates this issue via scaling up training and computation budget, which raises a critical question: Is scaling up the training and computation budget truly necessary for high-quality MMT, or can a deeper understanding of CoM provide a more efficient solution? To explore this problem, we analyze the linguistic conflicts and synergy, the underlying mechanism of CoM during post-training phase. We identify an asymmetric phenomenon in linguistic conflicts and synergy: the dominance of conflicts and synergy varies in different translation directions, leading to sub-optimal adaptation in existing post-training methods. We further find that a significant bottleneck in MMT appears to lie in post-training rather than multilingual pre-training, suggesting the need for more effective adaptation strategies. Building on these new insights, we propose a direction-aware training approach, combined with group-wise model merging, to address asymmetry in linguistic conflicts and synergy explicitly. Leveraging this strategy, our method fine-tunes X-ALMA-13B-Pretrain-trained only with multilingual pre-training-achieving comparable performance to XALMA-13B (only SFT) while using only 20B pretraining tokens and 17B parameters-5.5x fewer pretraining-tokens and 1.7x fewer model size-with just 0.85 COMET drop on Flores-200 testsets of 50 languages.

Asymmetric Conflict and Synergy in Post-training for LLM-based Multilingual Machine Translation

TL;DR

This work addresses the Curse of Multilinguality in LLM-based multilingual machine translation by uncovering an asymmetric pattern of linguistic conflicts versus synergy across translation directions during post-training. It introduces Direction-Aware Training (DAT) and group-wise model merging (DATM) to exploit this asymmetry, achieving strong translation quality starting from a relatively lightweight multilingual pretraining on B tokens and a compact LoRA-based setup. Key findings show that XX→En directions suffer from conflicts that DAT mitigates, while En→XX directions benefit from synergy that is preserved by merging only in the XX→En direction, resulting in substantial efficiency gains with comparable Flores-200 and WMT23 performance. The approach demonstrates that careful, direction-specific post-training can significantly reduce pretraining cost and model size while maintaining high multilingual translation quality, offering a scalable path toward resource-efficient MMT across many languages.

Abstract

The emergence of Large Language Models (LLMs) has advanced the multilingual machine translation (MMT), yet the Curse of Multilinguality (CoM) remains a major challenge. Existing work in LLM-based MMT typically mitigates this issue via scaling up training and computation budget, which raises a critical question: Is scaling up the training and computation budget truly necessary for high-quality MMT, or can a deeper understanding of CoM provide a more efficient solution? To explore this problem, we analyze the linguistic conflicts and synergy, the underlying mechanism of CoM during post-training phase. We identify an asymmetric phenomenon in linguistic conflicts and synergy: the dominance of conflicts and synergy varies in different translation directions, leading to sub-optimal adaptation in existing post-training methods. We further find that a significant bottleneck in MMT appears to lie in post-training rather than multilingual pre-training, suggesting the need for more effective adaptation strategies. Building on these new insights, we propose a direction-aware training approach, combined with group-wise model merging, to address asymmetry in linguistic conflicts and synergy explicitly. Leveraging this strategy, our method fine-tunes X-ALMA-13B-Pretrain-trained only with multilingual pre-training-achieving comparable performance to XALMA-13B (only SFT) while using only 20B pretraining tokens and 17B parameters-5.5x fewer pretraining-tokens and 1.7x fewer model size-with just 0.85 COMET drop on Flores-200 testsets of 50 languages.

Paper Structure

This paper contains 30 sections, 7 figures, 26 tables.

Figures (7)

  • Figure 1: The relationship between pre-training cost, model capacity and translation performance. We evaluate performance on the Flores-200 test sets across 50 languages. The size of circle denotes model capacity.
  • Figure 2: Performance of different models trained on varying numbers of languages. The dotted line represents the performance of separately trained models, serving as a reference point where no language conflicts or synergies occur. Two key findings emerge: (1) Asymmetry in Linguistic Conflicts and Synergy (Figure a–i), highlighting the uneven impact of multilingual training across language pairs; and (2) The Bottleneck of Multilinguality in Post-Training (Figure g–i): While multilingual pre-training provides a solid foundation for handling multiple languages, the multilingual training phase can lead to the CoM.
  • Figure 3: $\Delta$ COMET-22 between separate training and multilingual training in XX → En translation, grouped by resource level and linguistic features. The magnitude of $\Delta$ COMET-22 denotes the intensity of linguistic conflicts.
  • Figure 4: (a) Separate Training ($N$ = $N_L$): Each translation task is trained independently using different datasets for different language pairs, with distinct LoRA model weights fine-tuned separately; Multilingual Training ($N$ = $1$): All language pairs are combined to fine-tune a single model with shared LoRA weights; Group Multilingual Training ($N$ = $N_G$): Language pairs are grouped as specified in Table \ref{['tab:languages1']}-\ref{['tab:languages2']}, with an adapter trained for each group. (b) Group-wise model merging: For XX$\rightarrow$En translation, separate training is applied to each language pair. For En$\rightarrow$XX translation, group training is applied, where different tasks share LoRA weights within language groups.
  • Figure 5: Number of sentences per language pair in X-ALMA-Parallel-Data xu2024x
  • ...and 2 more figures