Table of Contents
Fetching ...

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

TL;DR

The Transformer paper addresses the inefficiency of recurrent/convolutional sequence transduction models by proposing a fully attention-based architecture that dispenses with recurrence and convolutions. It employs a six-layer encoder and six-layer decoder built from multi-head self-attention, encoder-decoder attention, and position-wise feed-forward networks with shared embeddings and sinusoidal positional encoding to capture sequence order, enabling high parallelization and constant path length between positions. The model achieves state-of-the-art BLEU on WMT 2014 English-German (28.4) and English-French (41.8) with substantially lower training cost, and generalizes to English constituency parsing with large and limited data. This work demonstrates that self-attention can outperform traditional recurrent/convolutional seq2seq approaches and introduces techniques for training efficiency, regularization, and cross-task transfer.

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Attention Is All You Need

TL;DR

The Transformer paper addresses the inefficiency of recurrent/convolutional sequence transduction models by proposing a fully attention-based architecture that dispenses with recurrence and convolutions. It employs a six-layer encoder and six-layer decoder built from multi-head self-attention, encoder-decoder attention, and position-wise feed-forward networks with shared embeddings and sinusoidal positional encoding to capture sequence order, enabling high parallelization and constant path length between positions. The model achieves state-of-the-art BLEU on WMT 2014 English-German (28.4) and English-French (41.8) with substantially lower training cost, and generalizes to English constituency parsing with large and limited data. This work demonstrates that self-attention can outperform traditional recurrent/convolutional seq2seq approaches and introduces techniques for training efficiency, regularization, and cross-task transfer.

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Paper Structure

This paper contains 27 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The Transformer - model architecture.
  • Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel.
  • Figure 3: An example of the attention mechanism following long-distance dependencies in the encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of the verb 'making', completing the phrase 'making...more difficult'. Attentions here shown only for the word 'making'. Different colors represent different heads. Best viewed in color.
  • Figure 4: Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top: Full attentions for head 5. Bottom: Isolated attentions from just the word 'its' for attention heads 5 and 6. Note that the attentions are very sharp for this word.
  • Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the sentence. We give two such examples above, from two different heads from the encoder self-attention at layer 5 of 6. The heads clearly learned to perform different tasks.