A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models

Haoran Xu; Young Jin Kim; Amr Sharaf; Hany Hassan Awadalla

A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models

Haoran Xu, Young Jin Kim, Amr Sharaf, Hany Hassan Awadalla

TL;DR

This work tackles the translation gap for decoder-only large language models with 7B–13B parameters by introducing ALMA, a two-stage fine-tuning recipe that first leverages monolingual data and then a small set of high-quality parallel data. The approach reduces reliance on massive parallel corpora while delivering substantial gains (average >12 BLEU and COMET) across 10 directions on WMT'21/22, surpassing prior decoder-only efforts and matching or exceeding some SoTA models. Key findings show that modest monolingual data (around 1B tokens) combined with high-quality parallel data yields strong MT performance, and that data quality often trumps quantity, with notable improvements in non-English translation and cross-lingual capabilities. The results establish ALMA as a scalable, efficient training paradigm for high-quality translation from moderate-size LLMs, with practical implications for multilingual MT in resource-constrained settings.

Abstract

Generative Large Language Models (LLMs) have achieved remarkable advancements in various NLP tasks. However, these advances have not been reflected in the translation task, especially those with moderate model sizes (i.e., 7B or 13B parameters), which still lag behind conventional supervised encoder-decoder translation models. Previous studies have attempted to improve the translation capabilities of these moderate LLMs, but their gains have been limited. In this study, we propose a novel fine-tuning approach for LLMs that is specifically designed for the translation task, eliminating the need for the abundant parallel data that traditional translation models usually depend on. Our approach consists of two fine-tuning stages: initial fine-tuning on monolingual data followed by subsequent fine-tuning on a small set of high-quality parallel data. We introduce the LLM developed through this strategy as Advanced Language Model-based trAnslator (ALMA). Based on LLaMA-2 as our underlying model, our results show that the model can achieve an average improvement of more than 12 BLEU and 12 COMET over its zero-shot performance across 10 translation directions from the WMT'21 (2 directions) and WMT'22 (8 directions) test datasets. The performance is significantly better than all prior work and even superior to the NLLB-54B model and GPT-3.5-text-davinci-003, with only 7B or 13B parameters. This method establishes the foundation for a novel training paradigm in machine translation.

A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models

TL;DR

Abstract

Paper Structure (35 sections, 2 equations, 6 figures, 12 tables)

This paper contains 35 sections, 2 equations, 6 figures, 12 tables.

Introduction
Preliminary
Task Definition
A Backbone LLM for Translation
Do LLMs Have an Appetite for Parallel Data?
Experimental Design
Observations
A New Training Recipe
Experiments
Data
Training Setup
Baselines
Results
Analyses
How Much Monolingual Data to Use?
...and 20 more sections

Figures (6)

Figure 1: Translation performance of contemporary decoder-only LLM translation systems based on LLaMA bigtranslatebayling, and zero-shot performance of LLaMA, for the WMT'22 test data across 8 directions (translating to or from English for German, Czech, Chinese, and Russian). Benchmark comparisons also include two leading translation models, NLLB-54B and GPT-3.5-text-davinci-003. Our systems, developed on LLaMA-2 with 7B and 13B parameters, surpass previous models by an impressive margin of nearly 10 BLEU and 7 COMET. Furthermore, they even slightly outperform GPT-3.5 and NLLB-54B on average.
Figure 2: The prompt used for training and evaluation. [source language] and [target language] represent the full name of the language, e.g., Translate this from German to English. Note that we do not compute loss for the prompt.
Figure 3: Averaged zero-shot translation performance on 10 directions: cs$\leftrightarrow$en, de$\leftrightarrow$en, is$\leftrightarrow$en, zh$\leftrightarrow$en, ru$\leftrightarrow$en, where is$\leftrightarrow$en is from WMT'21 test data and the others from WMT'22 test data.
Figure 4: BLEU and COMET scores obtained during the fine-tuning of MPT-7B and LLaMA-2-7B across each data step for en$\rightarrow$ru. Additionally, we present the results for NLLB-54B and a 7B model trained from scratch. A notable decline in LLaMA-2-7B's COMET score suggests that substantial parallel data might dilute its pre-existing knowledge.
Figure 5: The average performance of ALMA-7B at the completion of each 1B-token fine-tuning. The scores in the figure are averaged across 10 directions
...and 1 more figures

A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models

TL;DR

Abstract

A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)