Table of Contents
Fetching ...

Faster Transformer Decoding: N-gram Masked Self-Attention

Ciprian Chelba, Mia Chen, Ankur Bapna, Noam Shazeer

TL;DR

The paper addresses the computational bottleneck of full decoder self-attention in transformers by introducing an N-gram masked self-attention mechanism that preserves mainly information from the source when predicting each target token. By limiting the decoder's attention window to the previous N−1 tokens, the approach achieves O(N·T) complexity and enables a fixed-size memory buffer, with N around 8–10 yielding near-baseline BLEU on WMT EnDe/EnFr and potential 2–3× speedups. Results demonstrate that substantial speed gains can be realized with only modest BLEU losses, highlighting practical benefits for decoding efficiency. The work motivates hardware-aware optimizations and further exploration of decoder-local attention in MT and related sequence modeling tasks.

Abstract

Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence $S=s_1, \ldots, s_S$, we propose truncating the target-side window used for computing self-attention by making an $N$-gram assumption. Experiments on WMT EnDe and EnFr data sets show that the $N$-gram masked self-attention model loses very little in BLEU score for $N$ values in the range $4, \ldots, 8$, depending on the task.

Faster Transformer Decoding: N-gram Masked Self-Attention

TL;DR

The paper addresses the computational bottleneck of full decoder self-attention in transformers by introducing an N-gram masked self-attention mechanism that preserves mainly information from the source when predicting each target token. By limiting the decoder's attention window to the previous N−1 tokens, the approach achieves O(N·T) complexity and enables a fixed-size memory buffer, with N around 8–10 yielding near-baseline BLEU on WMT EnDe/EnFr and potential 2–3× speedups. Results demonstrate that substantial speed gains can be realized with only modest BLEU losses, highlighting practical benefits for decoding efficiency. The work motivates hardware-aware optimizations and further exploration of decoder-local attention in MT and related sequence modeling tasks.

Abstract

Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence , we propose truncating the target-side window used for computing self-attention by making an -gram assumption. Experiments on WMT EnDe and EnFr data sets show that the -gram masked self-attention model loses very little in BLEU score for values in the range , depending on the task.

Paper Structure

This paper contains 5 sections, 2 tables.