How Far Can 100 Samples Go? Unlocking Overall Zero-Shot Multilingual Translation via Tiny Multi-Parallel Data

Di Wu; Shaomu Tan; Yan Meng; David Stap; Christof Monz

How Far Can 100 Samples Go? Unlocking Overall Zero-Shot Multilingual Translation via Tiny Multi-Parallel Data

Di Wu, Shaomu Tan, Yan Meng, David Stap, Christof Monz

TL;DR

It is shown that for an English-centric model, surprisingly large zero-shot improvements can be achieved by simply fine-tuning with a very small amount of multi-parallel data, and the resulting non-English performance is close to the complete translation upper bound.

Abstract

Zero-shot translation aims to translate between language pairs not seen during training in Multilingual Machine Translation (MMT) and is largely considered an open problem. A common, albeit resource-consuming, solution is to add as many related translation directions as possible to the training corpus. In this paper, we show that for an English-centric model, surprisingly large zero-shot improvements can be achieved by simply fine-tuning with a very small amount of multi-parallel data. For example, on the EC30 dataset, we obtain up to +21.7 ChrF non-English overall improvements (870 directions) by using only 100 multi-parallel samples while preserving English-centric translation quality. When investigating the size effect of fine-tuning data and its transfer capabilities, we found that already a small, randomly sampled set of fine-tuning directions is sufficient to achieve comparable improvements. The resulting non-English performance is close to the complete translation upper bound. Even in a minimal setting -- fine-tuning with only one single sample -- the well-known off-target issue is almost completely resolved, explaining parts -- but not all -- of the observed improvements in translation quality.

How Far Can 100 Samples Go? Unlocking Overall Zero-Shot Multilingual Translation via Tiny Multi-Parallel Data

TL;DR

Abstract

Paper Structure (33 sections, 6 figures, 19 tables)

This paper contains 33 sections, 6 figures, 19 tables.

Introduction
Related Work
Experiments
Fine-Tuning Data Construction
Datasets
NTREX-128.
Europarl-8.
EC30.
Evaluation Benchmark.
Experimental Setup
Training Setting
Fine-Tuning Setting
Large-Scale Experiments on EC30
More Data or More Directions?
How Close to the Upper Bound?
...and 18 more sections

Figures (6)

Figure 1: (a) English-centric training data is normally readily available but can only cover a few real-world directions, while (b) complete translation freitag-firat-2020-complete aims to cover all but suffers from the small data scale. (c) Mining partial non-English data as the bridge languages shows promising zero-shot improvements but is also resource-consuming when scaling up. (d) We show that substantial overall improvements can be achieved by fine-tuning an English-centric model with tiny extra multi-parallel data, which is readily available, like NTREX federmann-etal-2022-ntrex.
Figure 2: Zero-shot performance (ChrF) on EC30 for each scaling step, grouped by High-, Medium, and Low-resource, respectively. (a) When we randomly selected {10%, 20%, 40%, 80%} of fine-tuning directions, overall zero-shot performance nearly stayed unchanged. However, (b) when we fixed 10% of directions and increased the fine-tuning samples from 100 to 800, consistent improvements can be observed for all resource groups.
Figure 3: ChrF improvements of the upper bound and boosted models over the English-centric baseline on the Europarl-8 dataset. It is clear that the overall non-English capability of the boosted model is close to the upper bound (complete translation), meanwhile, it also holds the performance in English-centric directions.
Figure 4: Zero-shot performance and off-target ratio on Europarl-8 at each scaling step. The green solid line denotes the quality improvements of the translation samples that have no off-target issue.
Figure 5: Zero-shot performance (ChrF) on EC30. Boost-All means fully fine-tuning, while Boost-Germanic means partially fine-tuning using Germanic languages. (a) shows the average performance evaluated within a specific language group, where both the source and target languages belong. (b) and (c) show the average performance in out-of-Germanic and into-Germanic directions, respectively. Detailed results are provided in Table \ref{['table-more-data-more-direction']}.
...and 1 more figures

How Far Can 100 Samples Go? Unlocking Overall Zero-Shot Multilingual Translation via Tiny Multi-Parallel Data

TL;DR

Abstract

How Far Can 100 Samples Go? Unlocking Overall Zero-Shot Multilingual Translation via Tiny Multi-Parallel Data

Authors

TL;DR

Abstract

Table of Contents

Figures (6)