Table of Contents
Fetching ...

Hard-Coded Gaussian Attention for Neural Machine Translation

Weiqiu You, Simeng Sun, Mohit Iyyer

TL;DR

The paper investigates whether the Transformer’s multi-headed attention is strictly necessary for high-quality neural machine translation. By introducing a parameter-free, hard-coded Gaussian attention variant, it shows that encoder and decoder self-attention can be replaced with fixed local distributions with little BLEU loss, while cross-attention remains critical. A single learned cross-attention head in the final decoder layer recovers most of the performance, suggesting that simpler attention mechanisms can achieve near-baseline translation quality. The findings imply avenues for more efficient, easier-to-implement attention models without substantial accuracy penalties, with practical benefits in memory use and decoding speed.

Abstract

Recent work has questioned the importance of the Transformer's multi-headed attention for achieving high translation quality. We push further in this direction by developing a "hard-coded" attention variant without any learned parameters. Surprisingly, replacing all learned self-attention heads in the encoder and decoder with fixed, input-agnostic Gaussian distributions minimally impacts BLEU scores across four different language pairs. However, additionally hard-coding cross attention (which connects the decoder to the encoder) significantly lowers BLEU, suggesting that it is more important than self-attention. Much of this BLEU drop can be recovered by adding just a single learned cross attention head to an otherwise hard-coded Transformer. Taken as a whole, our results offer insight into which components of the Transformer are actually important, which we hope will guide future work into the development of simpler and more efficient attention-based models.

Hard-Coded Gaussian Attention for Neural Machine Translation

TL;DR

The paper investigates whether the Transformer’s multi-headed attention is strictly necessary for high-quality neural machine translation. By introducing a parameter-free, hard-coded Gaussian attention variant, it shows that encoder and decoder self-attention can be replaced with fixed local distributions with little BLEU loss, while cross-attention remains critical. A single learned cross-attention head in the final decoder layer recovers most of the performance, suggesting that simpler attention mechanisms can achieve near-baseline translation quality. The findings imply avenues for more efficient, easier-to-implement attention models without substantial accuracy penalties, with practical benefits in memory use and decoding speed.

Abstract

Recent work has questioned the importance of the Transformer's multi-headed attention for achieving high translation quality. We push further in this direction by developing a "hard-coded" attention variant without any learned parameters. Surprisingly, replacing all learned self-attention heads in the encoder and decoder with fixed, input-agnostic Gaussian distributions minimally impacts BLEU scores across four different language pairs. However, additionally hard-coding cross attention (which connects the decoder to the encoder) significantly lowers BLEU, suggesting that it is more important than self-attention. Much of this BLEU drop can be recovered by adding just a single learned cross attention head to an otherwise hard-coded Transformer. Taken as a whole, our results offer insight into which components of the Transformer are actually important, which we hope will guide future work into the development of simpler and more efficient attention-based models.

Paper Structure

This paper contains 32 sections, 2 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Three heads of learned self-attention (top) as well as our hard-coded attention (bottom) given the query word "to". In our variant, each attention head is a Gaussian distribution centered around a different token within a local window.
  • Figure 2: Most learned attention heads for a Transformer trained on IWSLT16 En-De focus on a local window around the query position. The x-axis plots each head of each layer, while the y-axis refers to the distance between the query position and the argmax of the attention head distribution (averaged across the entire dataset).
  • Figure 3: BLEU performance on WMT16 En-Ro before and after removing all feed-forward layers from the models. base and hc-sa achieve almost identical BLEU scores, but hc-sa relies more on the feed-forward layers than the vanilla Transformer. As shown on the plot, with a four layer encoder and decoder, the BLEU gap between base-ff and base is 1.8, while the gap between hc-sa and hc-sa-ff is 3.2.
  • Figure 4: BLEU difference vs. base as a function of reference length on the WMT14 En-De test set. When cross attention is hard-coded (hc-all), the BLEU gap worsens as reference length increases.
  • Figure 5: Hard-coded models become increasingly worse than base at subject-verb agreement as the dependency grows longer.
  • ...and 3 more figures