Translation Artifacts in Cross-lingual Transfer Learning
Mikel Artetxe, Gorka Labaka, Eneko Agirre
TL;DR
The paper investigates how translation-based data creation and translation artifacts affect cross-lingual transfer learning. By training models on Orig, BT, and MT variants and evaluating on original and translated test sets for NLI and QA, it reveals that artifact-driven changes—especially independent translation of premises and hypotheses—can substantially mislead performance, elevating Translate-Test and Zero-Shot results beyond what true cross-lingual generalization would suggest. The authors demonstrate that back-translation paraphrasing and machine translation of training data can close much of the Translate-Test gap, and they achieve state-of-the-art improvements on XNLI using a single MT system, while also showing QA effects are benchmark-dependent. They advocate rethinking evaluation protocols and benchmark design to account for translation artifacts and provide a Spanish-language NLI dataset to spur future cross-lingual analysis.
Abstract
Both human and machine translation play a central role in cross-lingual transfer learning: many multilingual datasets have been created through professional translation services, and using machine translation to translate either the test set or the training set is a widely used transfer technique. In this paper, we show that such translation process can introduce subtle artifacts that have a notable impact in existing cross-lingual models. For instance, in natural language inference, translating the premise and the hypothesis independently can reduce the lexical overlap between them, which current models are highly sensitive to. We show that some previous findings in cross-lingual transfer learning need to be reconsidered in the light of this phenomenon. Based on the gained insights, we also improve the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively.
