Hard-Coded Gaussian Attention for Neural Machine Translation
Weiqiu You, Simeng Sun, Mohit Iyyer
TL;DR
The paper investigates whether the Transformer’s multi-headed attention is strictly necessary for high-quality neural machine translation. By introducing a parameter-free, hard-coded Gaussian attention variant, it shows that encoder and decoder self-attention can be replaced with fixed local distributions with little BLEU loss, while cross-attention remains critical. A single learned cross-attention head in the final decoder layer recovers most of the performance, suggesting that simpler attention mechanisms can achieve near-baseline translation quality. The findings imply avenues for more efficient, easier-to-implement attention models without substantial accuracy penalties, with practical benefits in memory use and decoding speed.
Abstract
Recent work has questioned the importance of the Transformer's multi-headed attention for achieving high translation quality. We push further in this direction by developing a "hard-coded" attention variant without any learned parameters. Surprisingly, replacing all learned self-attention heads in the encoder and decoder with fixed, input-agnostic Gaussian distributions minimally impacts BLEU scores across four different language pairs. However, additionally hard-coding cross attention (which connects the decoder to the encoder) significantly lowers BLEU, suggesting that it is more important than self-attention. Much of this BLEU drop can be recovered by adding just a single learned cross attention head to an otherwise hard-coded Transformer. Taken as a whole, our results offer insight into which components of the Transformer are actually important, which we hope will guide future work into the development of simpler and more efficient attention-based models.
