Table of Contents
Fetching ...

Exact Hard Monotonic Attention for Character-Level Transduction

Shijie Wu, Ryan Cotterell

TL;DR

This work investigates whether monotonicity is a beneficial inductive bias for character-level transduction by introducing an exact hard-monotonic attention framework. The authors extend prior hard attention with monotone alignment constraints and neural parameterization, enabling cubic-time marginalization over alignments and greedy decoding for inference. They demonstrate state-of-the-art single-model performance on morphological inflection and strong results on grapheme-to-phoneme conversion and named-entity transliteration, arguing that jointly learned monotone alignments are advantageous. The approach highlights the interpretability of alignment distributions and the practical viability of monotone transducers, albeit with some computational overhead relative to non-monotonic baselines. Code is released to facilitate reuse and further research.

Abstract

Many common character-level, string-to string transduction tasks, e.g., grapheme-tophoneme conversion and morphological inflection, consist almost exclusively of monotonic transductions. However, neural sequence-to sequence models that use non-monotonic soft attention often outperform popular monotonic models. In this work, we ask the following question: Is monotonicity really a helpful inductive bias for these tasks? We develop a hard attention sequence-to-sequence model that enforces strict monotonicity and learns a latent alignment jointly while learning to transduce. With the help of dynamic programming, we are able to compute the exact marginalization over all monotonic alignments. Our models achieve state-of-the-art performance on morphological inflection. Furthermore, we find strong performance on two other character-level transduction tasks. Code is available at https://github.com/shijie-wu/neural-transducer.

Exact Hard Monotonic Attention for Character-Level Transduction

TL;DR

This work investigates whether monotonicity is a beneficial inductive bias for character-level transduction by introducing an exact hard-monotonic attention framework. The authors extend prior hard attention with monotone alignment constraints and neural parameterization, enabling cubic-time marginalization over alignments and greedy decoding for inference. They demonstrate state-of-the-art single-model performance on morphological inflection and strong results on grapheme-to-phoneme conversion and named-entity transliteration, arguing that jointly learned monotone alignments are advantageous. The approach highlights the interpretability of alignment distributions and the practical viability of monotone transducers, albeit with some computational overhead relative to non-monotonic baselines. Code is released to facilitate reuse and further research.

Abstract

Many common character-level, string-to string transduction tasks, e.g., grapheme-tophoneme conversion and morphological inflection, consist almost exclusively of monotonic transductions. However, neural sequence-to sequence models that use non-monotonic soft attention often outperform popular monotonic models. In this work, we ask the following question: Is monotonicity really a helpful inductive bias for these tasks? We develop a hard attention sequence-to-sequence model that enforces strict monotonicity and learns a latent alignment jointly while learning to transduce. With the help of dynamic programming, we are able to compute the exact marginalization over all monotonic alignments. Our models achieve state-of-the-art performance on morphological inflection. Furthermore, we find strong performance on two other character-level transduction tasks. Code is available at https://github.com/shijie-wu/neural-transducer.

Paper Structure

This paper contains 25 sections, 11 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Example of source and target string for each task. Tag guides transduction in morphological inflection.
  • Figure 2: Our monotonic hard-attention model viewed as a graphical model. The circular nodes are random variables and the diamond nodes are deterministic variables. We have omitted arcs from $\boldsymbol{x}$ to $y_1$, $y_2$, $y_3$ and $y_4$ for clarity (to avoid crossing arcs).