Tagged Back-Translation
Isaac Caswell, Ciprian Chelba, David Grangier
TL;DR
Tagged Back-Translation (TaggedBT) marks synthetic BT data with a distinct input tag to signal its origin, enabling the model to treat BT data as a separate domain. The approach often matches or surpasses NoisedBT across language pairs, with strong gains on EnRo and competitive results on EnDe, while enabling iterative back-translation in some setups. Analyses show the tag drives focused attention on the tag and shifts decoding behavior toward a BT-domain translation, supporting the idea that simple domain signaling effectively separates beneficial and biased signals from synthetic data. Overall, tagging offers a simpler, robust alternative to back-translation noise with practical benefits for NMT systems.
Abstract
Recent work in Neural Machine Translation (NMT) has shown significant quality gains from noised-beam decoding during back-translation, a method to generate synthetic parallel data. We show that the main role of such synthetic noise is not to diversify the source side, as previously suggested, but simply to indicate to the model that the given source is synthetic. We propose a simpler alternative to noising techniques, consisting of tagging back-translated source sentences with an extra token. Our results on WMT outperform noised back-translation in English-Romanian and match performance on English-German, re-defining state-of-the-art in the former.
