Table of Contents
Fetching ...

SEARNN: Training RNNs with Global-Local Losses

Rémi Leblond, Jean-Baptiste Alayrac, Anton Osokin, Simon Lacoste-Julien

TL;DR

This paper identifies a misalignment between conventional maximum-likelihood training and test-time structured errors in RNN-based sequence prediction, notably suffering from exposure bias. It introduces SeaRnn, a learning-to-search–inspired training framework that constructs global-local losses by rolling in and rolling out predictions to generate sequence-informed cost vectors at each time step, with LL and KL as principal losses. SeaRnn demonstrates significant improvements over MLE on OCR and spelling correction, and scales to neural machine translation through subsampling of cells and tokens, maintaining performance while reducing computation. The work positions SeaRnn as a principled alternative to RL-based methods, offering stronger use of structured loss signals, reduced reliance on warm-starts, and practical scalability to large vocabulary tasks.

Abstract

We propose SEARNN, a novel training algorithm for recurrent neural networks (RNNs) inspired by the "learning to search" (L2S) approach to structured prediction. RNNs have been widely successful in structured prediction applications such as machine translation or parsing, and are commonly trained using maximum likelihood estimation (MLE). Unfortunately, this training loss is not always an appropriate surrogate for the test error: by only maximizing the ground truth probability, it fails to exploit the wealth of information offered by structured losses. Further, it introduces discrepancies between training and predicting (such as exposure bias) that may hurt test performance. Instead, SEARNN leverages test-alike search space exploration to introduce global-local losses that are closer to the test error. We first demonstrate improved performance over MLE on two different tasks: OCR and spelling correction. Then, we propose a subsampling strategy to enable SEARNN to scale to large vocabulary sizes. This allows us to validate the benefits of our approach on a machine translation task.

SEARNN: Training RNNs with Global-Local Losses

TL;DR

This paper identifies a misalignment between conventional maximum-likelihood training and test-time structured errors in RNN-based sequence prediction, notably suffering from exposure bias. It introduces SeaRnn, a learning-to-search–inspired training framework that constructs global-local losses by rolling in and rolling out predictions to generate sequence-informed cost vectors at each time step, with LL and KL as principal losses. SeaRnn demonstrates significant improvements over MLE on OCR and spelling correction, and scales to neural machine translation through subsampling of cells and tokens, maintaining performance while reducing computation. The work positions SeaRnn as a principled alternative to RL-based methods, offering stronger use of structured loss signals, reduced reliance on warm-starts, and practical scalability to large vocabulary tasks.

Abstract

We propose SEARNN, a novel training algorithm for recurrent neural networks (RNNs) inspired by the "learning to search" (L2S) approach to structured prediction. RNNs have been widely successful in structured prediction applications such as machine translation or parsing, and are commonly trained using maximum likelihood estimation (MLE). Unfortunately, this training loss is not always an appropriate surrogate for the test error: by only maximizing the ground truth probability, it fails to exploit the wealth of information offered by structured losses. Further, it introduces discrepancies between training and predicting (such as exposure bias) that may hurt test performance. Instead, SEARNN leverages test-alike search space exploration to introduce global-local losses that are closer to the test error. We first demonstrate improved performance over MLE on two different tasks: OCR and spelling correction. Then, we propose a subsampling strategy to enable SEARNN to scale to large vocabulary sizes. This allows us to validate the benefits of our approach on a machine translation task.

Paper Structure

This paper contains 46 sections, 8 equations, 1 figure, 3 tables, 2 algorithms.

Figures (1)

  • Figure 1: Illustration of the roll-in/roll-out mechanism used in SeaRnn. The goal is to obtain a vector of costs for each cell of the RNN in order to define a cost-sensitive loss to train the network. These vectors have one entry per possible token. Here, we show how to obtain the vector of costs for the red cell. First, we use a roll-in policy to predict until the cell of interest. We highlight here the learned policy where the network passes its own prediction to the next cell. Second, we proceed to the roll-out phase. We feed every possible token (illustrated by the red letters) to the next cell and let the model predict the full sequence. For each token $a$, we obtain a predicted sequence $\hat{y}_a$. Comparing it to the ground truth sequence $y$ yields the associated cost $c(a)$.