Table of Contents
Fetching ...

Reward Augmented Maximum Likelihood for Neural Structured Prediction

Mohammad Norouzi, Samy Bengio, Zhifeng Chen, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans

TL;DR

RAML presents a hybrid training objective that directly incorporates task rewards into structured output prediction by augmenting training targets with outputs sampled from an exponentiated payoff distribution. By minimizing a KL divergence $D_{KL}(q(y\mid y^*; \tau) \Vert p_\theta(y\mid x))$, RAML achieves a balance between supervised learning and reward optimization, with gradients computed by sampling from $q$. The approach is shown to be mathematically linked to regularized expected reward and distinct from RL due to the opposite KL directions and variance considerations. Empirically, RAML yields consistent improvements over strong ML baselines in both speech recognition (TIMIT) and machine translation (WMT'14 En–Fr), demonstrating its practicality and potential for broader application to non-differentiable evaluation metrics.

Abstract

A key problem in structured output prediction is direct optimization of the task reward function that matters for test evaluation. This paper presents a simple and computationally efficient approach to incorporate task reward into a maximum likelihood framework. By establishing a link between the log-likelihood and expected reward objectives, we show that an optimal regularized expected reward is achieved when the conditional distribution of the outputs given the inputs is proportional to their exponentiated scaled rewards. Accordingly, we present a framework to smooth the predictive probability of the outputs using their corresponding rewards. We optimize the conditional log-probability of augmented outputs that are sampled proportionally to their exponentiated scaled rewards. Experiments on neural sequence to sequence models for speech recognition and machine translation show notable improvements over a maximum likelihood baseline by using reward augmented maximum likelihood (RAML), where the rewards are defined as the negative edit distance between the outputs and the ground truth labels.

Reward Augmented Maximum Likelihood for Neural Structured Prediction

TL;DR

RAML presents a hybrid training objective that directly incorporates task rewards into structured output prediction by augmenting training targets with outputs sampled from an exponentiated payoff distribution. By minimizing a KL divergence , RAML achieves a balance between supervised learning and reward optimization, with gradients computed by sampling from . The approach is shown to be mathematically linked to regularized expected reward and distinct from RL due to the opposite KL directions and variance considerations. Empirically, RAML yields consistent improvements over strong ML baselines in both speech recognition (TIMIT) and machine translation (WMT'14 En–Fr), demonstrating its practicality and potential for broader application to non-differentiable evaluation metrics.

Abstract

A key problem in structured output prediction is direct optimization of the task reward function that matters for test evaluation. This paper presents a simple and computationally efficient approach to incorporate task reward into a maximum likelihood framework. By establishing a link between the log-likelihood and expected reward objectives, we show that an optimal regularized expected reward is achieved when the conditional distribution of the outputs given the inputs is proportional to their exponentiated scaled rewards. Accordingly, we present a framework to smooth the predictive probability of the outputs using their corresponding rewards. We optimize the conditional log-probability of augmented outputs that are sampled proportionally to their exponentiated scaled rewards. Experiments on neural sequence to sequence models for speech recognition and machine translation show notable improvements over a maximum likelihood baseline by using reward augmented maximum likelihood (RAML), where the rewards are defined as the negative edit distance between the outputs and the ground truth labels.

Paper Structure

This paper contains 12 sections, 3 theorems, 26 equations, 1 figure, 2 tables.

Key Result

Proposition 1

For any twice differentiable strictly convex closed potential $F$, and $p, q\in\mathrm{int}(\mathcal{F})$: for some $a=(1-\alpha)p+\alpha q$, ($0\leq\alpha\leq { \frac{1}{2}}$), $b=(1-\beta)q+\beta p$, ($0\leq\beta\leq{ \frac{1}{2}}$). (see Appendix app:rml-proofs)

Figures (1)

  • Figure 1: Fraction of different number of edits applied to a sequence of length $20$ for different $\tau$. At $\tau = 0.9$, augmentations with $5$ to $9$ edits are sampled with a probability $> 0.1$. [view in color]

Theorems & Definitions (5)

  • Proposition 2
  • Proposition 1
  • proof
  • Proposition 2
  • proof