Table of Contents
Fetching ...

Differentiable Scheduled Sampling for Credit Assignment

Kartik Goyal, Chris Dyer, Taylor Berg-Kirkpatrick

TL;DR

This work tackles exposure bias in seq2seq training by introducing differentiable relaxations of greedy decoding, enabling continuous backpropagation through earlier decoding decisions. It introduces soft-argmax and a Gumbel-based reparameterization for sample-based training, forming differentiable relaxed decoders within scheduled sampling. Empirical results on German-English MT and German NER show consistent improvements over cross-entropy and conventional scheduled sampling, highlighting improved credit assignment and potentially lower gradient variance. The approach maintains training efficiency comparable to standard seq2seq training and offers a scalable path for more informative training signals in sequence prediction tasks.

Abstract

We demonstrate that a continuous relaxation of the argmax operation can be used to create a differentiable approximation to greedy decoding for sequence-to-sequence (seq2seq) models. By incorporating this approximation into the scheduled sampling training procedure (Bengio et al., 2015)--a well-known technique for correcting exposure bias--we introduce a new training objective that is continuous and differentiable everywhere and that can provide informative gradients near points where previous decoding decisions change their value. In addition, by using a related approximation, we demonstrate a similar approach to sampled-based training. Finally, we show that our approach outperforms cross-entropy training and scheduled sampling procedures in two sequence prediction tasks: named entity recognition and machine translation.

Differentiable Scheduled Sampling for Credit Assignment

TL;DR

This work tackles exposure bias in seq2seq training by introducing differentiable relaxations of greedy decoding, enabling continuous backpropagation through earlier decoding decisions. It introduces soft-argmax and a Gumbel-based reparameterization for sample-based training, forming differentiable relaxed decoders within scheduled sampling. Empirical results on German-English MT and German NER show consistent improvements over cross-entropy and conventional scheduled sampling, highlighting improved credit assignment and potentially lower gradient variance. The approach maintains training efficiency comparable to standard seq2seq training and offers a scalable path for more informative training signals in sequence prediction tasks.

Abstract

We demonstrate that a continuous relaxation of the argmax operation can be used to create a differentiable approximation to greedy decoding for sequence-to-sequence (seq2seq) models. By incorporating this approximation into the scheduled sampling training procedure (Bengio et al., 2015)--a well-known technique for correcting exposure bias--we introduce a new training objective that is continuous and differentiable everywhere and that can provide informative gradients near points where previous decoding decisions change their value. In addition, by using a related approximation, we demonstrate a similar approach to sampled-based training. Finally, we show that our approach outperforms cross-entropy training and scheduled sampling procedures in two sequence prediction tasks: named entity recognition and machine translation.

Paper Structure

This paper contains 10 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Discontinuous scheduled sampling objective (red) and continuous relaxations (blue and purple).
  • Figure 2: Relaxed greedy decoder that uses a continuous approximation of argmax as input to the decoder state at next time step.