Table of Contents
Fetching ...

Emergent Communication Pretraining for Few-Shot Machine Translation

Yaoyiran Li, Edoardo M. Ponti, Ivan Vulić, Anna Korhonen

TL;DR

The paper addresses the challenge of extreme data scarcity for language tasks by pretraining encoders/decoders through emergent communication in image-grounded referential games, thereby inducing an inductive bias toward natural language without any human language data. After EC pretraining, the learned components are repurposed for few-shot neural machine translation and augmented with adapters and annealed regularisation to enable effective knowledge transfer and mitigate forgetting. Empirical results show large BLEU gains across multiple language pairs and data regimes, with strong support for the role of EC-derived representations and the synergy between adapters and regularisation. The work also uses translation performance as an extrinsic evaluation of emergent languages, linking communication success to downstream gains and offering a new protocol for probing artificial languages.

Abstract

While state-of-the-art models that rely upon massively multilingual pretrained encoders achieve sample efficiency in downstream applications, they still require abundant amounts of unlabelled text. Nevertheless, most of the world's languages lack such resources. Hence, we investigate a more radical form of unsupervised knowledge transfer in the absence of linguistic data. In particular, for the first time we pretrain neural networks via emergent communication from referential games. Our key assumption is that grounding communication on images---as a crude approximation of real-world environments---inductively biases the model towards learning natural languages. On the one hand, we show that this substantially benefits machine translation in few-shot settings. On the other hand, this also provides an extrinsic evaluation protocol to probe the properties of emergent languages ex vitro. Intuitively, the closer they are to natural languages, the higher the gains from pretraining on them should be. For instance, in this work we measure the influence of communication success and maximum sequence length on downstream performances. Finally, we introduce a customised adapter layer and annealing strategies for the regulariser of maximum-a-posteriori inference during fine-tuning. These turn out to be crucial to facilitate knowledge transfer and prevent catastrophic forgetting. Compared to a recurrent baseline, our method yields gains of $59.0\%$$\sim$$147.6\%$ in BLEU score with only $500$ NMT training instances and $65.1\%$$\sim$$196.7\%$ with $1,000$ NMT training instances across four language pairs. These proof-of-concept results reveal the potential of emergent communication pretraining for both natural language processing tasks in resource-poor settings and extrinsic evaluation of artificial languages.

Emergent Communication Pretraining for Few-Shot Machine Translation

TL;DR

The paper addresses the challenge of extreme data scarcity for language tasks by pretraining encoders/decoders through emergent communication in image-grounded referential games, thereby inducing an inductive bias toward natural language without any human language data. After EC pretraining, the learned components are repurposed for few-shot neural machine translation and augmented with adapters and annealed regularisation to enable effective knowledge transfer and mitigate forgetting. Empirical results show large BLEU gains across multiple language pairs and data regimes, with strong support for the role of EC-derived representations and the synergy between adapters and regularisation. The work also uses translation performance as an extrinsic evaluation of emergent languages, linking communication success to downstream gains and offering a new protocol for probing artificial languages.

Abstract

While state-of-the-art models that rely upon massively multilingual pretrained encoders achieve sample efficiency in downstream applications, they still require abundant amounts of unlabelled text. Nevertheless, most of the world's languages lack such resources. Hence, we investigate a more radical form of unsupervised knowledge transfer in the absence of linguistic data. In particular, for the first time we pretrain neural networks via emergent communication from referential games. Our key assumption is that grounding communication on images---as a crude approximation of real-world environments---inductively biases the model towards learning natural languages. On the one hand, we show that this substantially benefits machine translation in few-shot settings. On the other hand, this also provides an extrinsic evaluation protocol to probe the properties of emergent languages ex vitro. Intuitively, the closer they are to natural languages, the higher the gains from pretraining on them should be. For instance, in this work we measure the influence of communication success and maximum sequence length on downstream performances. Finally, we introduce a customised adapter layer and annealing strategies for the regulariser of maximum-a-posteriori inference during fine-tuning. These turn out to be crucial to facilitate knowledge transfer and prevent catastrophic forgetting. Compared to a recurrent baseline, our method yields gains of in BLEU score with only NMT training instances and with NMT training instances across four language pairs. These proof-of-concept results reveal the potential of emergent communication pretraining for both natural language processing tasks in resource-poor settings and extrinsic evaluation of artificial languages.

Paper Structure

This paper contains 12 sections, 9 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: An overview of the model architecture. Dashed lines denote parameter transfer from the EC pretraining task to the MT fine-tuning task. We stress that during EC pretraining, we do not leverage any image-caption pairs; instead, only unlabelled images are used. During MT fine-tuning, standard seq2seq NMT models are trained on SRC and TRG sentence pairs without any visual information available.
  • Figure 2: Impact of EC prediction accuracy on NMT BLEU scores for en-de (left) and ro-en (right). All BLEU scores are obtained in the '1k Samples' setup with the full model variant EC Transferred + Adapter + REG-A.
  • Figure 3: Impact of maximum EC message length ($L_{max}$) on NMT performance. All BLEU scores are obtained in the '1k Samples' setup with the full model variant EC Transferred + Adapter + REG-A.