Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice?

Dawei Zhu; Pinzhen Chen; Miaoran Zhang; Barry Haddow; Xiaoyu Shen; Dietrich Klakow

Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice?

Dawei Zhu, Pinzhen Chen, Miaoran Zhang, Barry Haddow, Xiaoyu Shen, Dietrich Klakow

TL;DR

It is found that LLMs display strong translation capability after being fine-tuned on as few as 32 parallel sentences and that fine-tuning on a single translation direction enables translation in multiple directions.

Abstract

Traditionally, success in multilingual machine translation can be attributed to three key factors in training data: large volume, diverse translation directions, and high quality. In the current practice of fine-tuning large language models (LLMs) for translation, we revisit the importance of these factors. We find that LLMs display strong translation capability after being fine-tuned on as few as 32 parallel sentences and that fine-tuning on a single translation direction enables translation in multiple directions. However, the choice of direction is critical: fine-tuning LLMs with only English on the target side can lead to task misinterpretation, which hinders translation into non-English languages. Problems also arise when noisy synthetic data is placed on the target side, especially when the target language is well-represented in LLM pre-training. Yet interestingly, synthesized data in an under-represented language has a less pronounced effect. Our findings suggest that when adapting LLMs to translation, the requirement on data quantity can be eased but careful considerations are still crucial to prevent an LLM from exploiting unintended data biases.

Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice?

TL;DR

Abstract

Paper Structure (40 sections, 1 equation, 13 figures, 4 tables)

This paper contains 40 sections, 1 equation, 13 figures, 4 tables.

Introduction
Preliminaries
Supervised fine-tuning
Superficial alignment hypothesis
Experiments and Results
Experimental setup
Training.
Evaluation.
How much SFT data enables LLMs to translate?
Setup.
Results.
Do we need to include all directions?
Setup.
SFT results.
ICL results.
...and 25 more sections

Figures (13)

Figure 1: Performance comparison between instruction-tuned baselines and Llama-2 fine-tuned with different training data sizes. Average COMET (left) and BLEU (right) scores across 11 translation directions are presented. For training data sizes of 1 and 3, ICL is applied, marked with an asterisk "$^*$"; otherwise, we perform SFT. With only 32 training examples for SFT, Llama-2 outperforms general-purpose, instruction-tuned baselines. Base.: instruction-tuned baseline models. See individual performance for the 11 translation directions in Appendix \ref{['appendix:sec:model_performance_sample_size_sep']}.
Figure 2: Normalized COMET score (as a % of performance from fine-tuning on an equivalent sized dataset of all 10 directions) resulted from varying combinations of train and test translation directions. In most cases, Llama-2 fine-tuned on a single translation direction can effectively translate across other directions, achieving performance comparable to models trained on all directions, with a few exceptions when trained on X$\rightarrow$en but tested on en$\rightarrow$X. Performance measured in BLEU score is provided in \ref{['appendix:sec:model_performance_vary_training_directions']}.
Figure 3: Average performance (in COMET) across 11 test directions for models trained with varying data sizes and directions. Both factors positively impact performance. +=: training directions added on top of previous directions; two directions are added at each time. For example, "+=ru" covers 10 directions: en $\leftrightarrow$ {de, zh, cs, jp, ru}. Performance on individual test directions is provided in \ref{['appendix:sec:model_performance_direction_sep']}.
Figure 4: Model performance (in COMET) across 15 translation directions under different training configurations. Training models on unseen languages (en$\leftrightarrow$is, en$\leftrightarrow$ha) results in slight improvements in translating these languages compared to models trained on en$\leftrightarrow$de. The differences in performance when translating between seen languages are minimal across all training configurations. Performance measured in BLEU score is provided in Appendix \ref{['appendix:sec:model_performance_unseen_sep']}.
Figure 5: Model performance in COMET score varying training sizes, directions, and noise types. Top (Bottom): score averaged across all en$\rightarrow$X (X$\rightarrow$en) test directions. Training sizes considered are 32 and 1024. Generally, introducing noise on the target side tends to degrade model performance more, with the extent of impact also depending on the particular language involved. Performance measured in BLEU score is provided in Appendix \ref{['appendix:sec:model_performance_noisy_sep']}.
...and 8 more figures

Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice?

TL;DR

Abstract

Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice?

Authors

TL;DR

Abstract

Table of Contents

Figures (13)