Table of Contents
Fetching ...

Hard Non-Monotonic Attention for Character-Level Transduction

Shijie Wu, Pamela Shapiro, Ryan Cotterell

TL;DR

An exact, polynomial-time algorithm for marginalizing over the exponential number of non-monotonic alignments between two strings is introduced, showing that hard attention models can be viewed as neural reparameterizations of the classical IBM Model 1.

Abstract

Character-level string-to-string transduction is an important component of various NLP tasks. The goal is to map an input string to an output string, where the strings may be of different lengths and have characters taken from different alphabets. Recent approaches have used sequence-to-sequence models with an attention mechanism to learn which parts of the input string the model should focus on during the generation of the output string. Both soft attention and hard monotonic attention have been used, but hard non-monotonic attention has only been used in other sequence modeling tasks such as image captioning (Xu et al., 2015), and has required a stochastic approximation to compute the gradient. In this work, we introduce an exact, polynomial-time algorithm for marginalizing over the exponential number of non-monotonic alignments between two strings, showing that hard attention models can be viewed as neural reparameterizations of the classical IBM Model 1. We compare soft and hard non-monotonic attention experimentally and find that the exact algorithm significantly improves performance over the stochastic approximation and outperforms soft attention. Code is available at https://github. com/shijie-wu/neural-transducer.

Hard Non-Monotonic Attention for Character-Level Transduction

TL;DR

An exact, polynomial-time algorithm for marginalizing over the exponential number of non-monotonic alignments between two strings is introduced, showing that hard attention models can be viewed as neural reparameterizations of the classical IBM Model 1.

Abstract

Character-level string-to-string transduction is an important component of various NLP tasks. The goal is to map an input string to an output string, where the strings may be of different lengths and have characters taken from different alphabets. Recent approaches have used sequence-to-sequence models with an attention mechanism to learn which parts of the input string the model should focus on during the generation of the output string. Both soft attention and hard monotonic attention have been used, but hard non-monotonic attention has only been used in other sequence modeling tasks such as image captioning (Xu et al., 2015), and has required a stochastic approximation to compute the gradient. In this work, we introduce an exact, polynomial-time algorithm for marginalizing over the exponential number of non-monotonic alignments between two strings, showing that hard attention models can be viewed as neural reparameterizations of the classical IBM Model 1. We compare soft and hard non-monotonic attention experimentally and find that the exact algorithm significantly improves performance over the stochastic approximation and outperforms soft attention. Code is available at https://github. com/shijie-wu/neural-transducer.

Paper Structure

This paper contains 39 sections, 12 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Example of a non-monotonic character-level transduction from the Micronesian language of Pingelapese. The infinitive mejr is mapped through a reduplicative process to its gerund mejmejrrehg1981rehg. Each input character is drawn in green and each output character is drawn in purple, connected with a line to the corresponding input character.
  • Figure 2: Our hard-attention model without input feeding viewed as a graphical model. Note that the circular nodes are random variables and the diamond nodes are deterministic variables ($\mathbf{h}^{\textit{(dec)}}_i$ is first discussed in \ref{['sec:decoder']}). The independence assumption between the alignments $a_i$ when the $y_i$ are observed becomes clear. Note that we have omitted arcs from $\boldsymbol{x}$ to $y_1$, $y_2$, $y_3$, and $y_4$ for clarity (to avoid crossing arcs). We alert the reader that the dashed edges show the additional dependencies added in the input feeding version, as discussed in \ref{['sec:input-feeding']}. Once we add these in, the $a_i$ are no longer independent and break exact marginalization. Note the hard-attention model does not enforce an exact one-to-one constraint. Each source-side word is free to align with many of the target-side words, independent of context. In the latent variable model, the $x$ variable is a vector of source words, and the alignment may be over more than one element of $x$.
  • Figure 3: Attention-weight ( ; left) and alignment distribution ( ; right) of Finnish in . Both models predict correctly.