Table of Contents
Fetching ...

Scheduled Sampling for Transformers

Tsvetomila Mihaylova, André F. T. Martins

TL;DR

This work tackles exposure bias in Transformer-based sequence-to-sequence models by introducing a two-pass decoding scheme that enables scheduled sampling for Transformers. The approach trains with a mixed reference in a second decoder pass while sharing parameters with the first pass, and explores several embedding-mix strategies and backpropagation schemes. Experiments on two language pairs show that certain non-differentiable embedding-mix variants can modestly improve validation BLEU over a teacher-forcing baseline, while differentiable backpropagation through both decoders can hurt performance. Overall, the study demonstrates a viable path to mitigating exposure bias in Transformer models and highlights scheduling choices as a key factor for success.

Abstract

Scheduled sampling is a technique for avoiding one of the known problems in sequence-to-sequence generation: exposure bias. It consists of feeding the model a mix of the teacher forced embeddings and the model predictions from the previous step in training time. The technique has been used for improving the model performance with recurrent neural networks (RNN). In the Transformer model, unlike the RNN, the generation of a new word attends to the full sentence generated so far, not only to the last word, and it is not straightforward to apply the scheduled sampling technique. We propose some structural changes to allow scheduled sampling to be applied to Transformer architecture, via a two-pass decoding strategy. Experiments on two language pairs achieve performance close to a teacher-forcing baseline and show that this technique is promising for further exploration.

Scheduled Sampling for Transformers

TL;DR

This work tackles exposure bias in Transformer-based sequence-to-sequence models by introducing a two-pass decoding scheme that enables scheduled sampling for Transformers. The approach trains with a mixed reference in a second decoder pass while sharing parameters with the first pass, and explores several embedding-mix strategies and backpropagation schemes. Experiments on two language pairs show that certain non-differentiable embedding-mix variants can modestly improve validation BLEU over a teacher-forcing baseline, while differentiable backpropagation through both decoders can hurt performance. Overall, the study demonstrates a viable path to mitigating exposure bias in Transformer models and highlights scheduling choices as a key factor for success.

Abstract

Scheduled sampling is a technique for avoiding one of the known problems in sequence-to-sequence generation: exposure bias. It consists of feeding the model a mix of the teacher forced embeddings and the model predictions from the previous step in training time. The technique has been used for improving the model performance with recurrent neural networks (RNN). In the Transformer model, unlike the RNN, the generation of a new word attends to the full sentence generated so far, not only to the last word, and it is not straightforward to apply the scheduled sampling technique. We propose some structural changes to allow scheduled sampling to be applied to Transformer architecture, via a two-pass decoding strategy. Experiments on two language pairs achieve performance close to a teacher-forcing baseline and show that this technique is promising for further exploration.

Paper Structure

This paper contains 8 sections, 1 equation, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Transformer model adapted for use with scheduled sampling. The two decoders on the image share the same parameters. The first pass on the decoder conditions on the gold target sequence and returns the model predictions. The second pass conditions on a mix of the target sequence and model predictions and returns the result. The thicker lines show the path that is backpropagated in all experiments, i.e. we always make backpropagation through the second decoder pass. The thin arrows are only backpropagated in a part of the experiments. (The image is based on the transformer architecture from the paper of vaswani2017attention.)