Table of Contents
Fetching ...

Monotonic Chunkwise Attention

Chung-Cheng Chiu, Colin Raffel

TL;DR

Problem: Standard soft attention incurs quadratic time/space and is unsuitable for online real-time transduction. Approach: MoChA combines a hard monotonic endpoint with soft attention over a small memory chunk preceding that endpoint, enabling online decoding with soft alignments and trainable via backpropagation. Findings: Achieves state-of-the-art performance on online speech recognition (WSJ) and substantially narrows the gap between monotonic and soft attention on document summarization (CNN/Daily Mail). Significance: Provides online, linear-time decoding while allowing local reorderings, with only modest additional parameters and computation.

Abstract

Sequence-to-sequence models with soft attention have been successfully applied to a wide variety of problems, but their decoding process incurs a quadratic time and space cost and is inapplicable to real-time sequence transduction. To address these issues, we propose Monotonic Chunkwise Attention (MoChA), which adaptively splits the input sequence into small chunks over which soft attention is computed. We show that models utilizing MoChA can be trained efficiently with standard backpropagation while allowing online and linear-time decoding at test time. When applied to online speech recognition, we obtain state-of-the-art results and match the performance of a model using an offline soft attention mechanism. In document summarization experiments where we do not expect monotonic alignments, we show significantly improved performance compared to a baseline monotonic attention-based model.

Monotonic Chunkwise Attention

TL;DR

Problem: Standard soft attention incurs quadratic time/space and is unsuitable for online real-time transduction. Approach: MoChA combines a hard monotonic endpoint with soft attention over a small memory chunk preceding that endpoint, enabling online decoding with soft alignments and trainable via backpropagation. Findings: Achieves state-of-the-art performance on online speech recognition (WSJ) and substantially narrows the gap between monotonic and soft attention on document summarization (CNN/Daily Mail). Significance: Provides online, linear-time decoding while allowing local reorderings, with only modest additional parameters and computation.

Abstract

Sequence-to-sequence models with soft attention have been successfully applied to a wide variety of problems, but their decoding process incurs a quadratic time and space cost and is inapplicable to real-time sequence transduction. To address these issues, we propose Monotonic Chunkwise Attention (MoChA), which adaptively splits the input sequence into small chunks over which soft attention is computed. We show that models utilizing MoChA can be trained efficiently with standard backpropagation while allowing online and linear-time decoding at test time. When applied to online speech recognition, we obtain state-of-the-art results and match the performance of a model using an offline soft attention mechanism. In document summarization experiments where we do not expect monotonic alignments, we show significantly improved performance compared to a baseline monotonic attention-based model.

Paper Structure

This paper contains 16 sections, 22 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Schematics of the attention mechanisms discussed in this paper. Each node represents the possibility of the model attending to a given memory entry (horizontal axis) at a given output timestep (vertical axis). (a) In soft attention, the model assigns a probability (represented by the shade of gray of each node) to each memory entry at each output timestep. The context vector is computed as the weighted average of the memory, weighted by these probabilities. (b) At test time, monotonic attention inspects memory entries from left-to-right, choosing whether to move on to the next memory entry (shown as nodes with $\times$) or stop and attend (shown as black nodes). The context vector is hard-assigned to the memory entry that was attended to. At the next output timestep, it starts again from where it left off. (c) MoChA utilizes a hard monotonic attention mechanism to choose the endpoint (shown as nodes with bold borders) of the chunk over which it attends. The chunk boundaries (here, with a window size of $3$) are shown as dotted lines. The model then performs soft attention (with attention weighting shown as the shade of gray) over the chunk, and computes the context vector as the chunk's weighted average.
  • Figure 2: Attention alignments plots and speech utterance feature sequence for the speech recognition task.
  • Figure 3: Speeds of different attention mechanisms on a synthetic benchmark.
  • Figure 4: Schematic of the test-time decoding procedure of MAtChA. The semantics of the nodes and horizontal and vertical axes are as in \ref{['fig:softmax_grid', 'fig:monotonic_grid', 'fig:mocha_grid']}. MAtChA performs soft attention over variable-sized chunks set by the locations attended to by a monotonic attention mechanism.