Table of Contents
Fetching ...

Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers

Vukasin Bozic, Danilo Dordevic, Daniele Coppola, Joseph Thommes, Sidak Pal Singh

TL;DR

The paper investigates whether shallow feed-forward networks can substitute for Transformer attention in sequence-to-sequence translation. It trains FF replacements via knowledge distillation from a vanilla Transformer and evaluates four encoder self-attention replacement schemes, then extends the best approach to decoder self-attention and cross-attention. On IWSLT2017, encoder self-attention replacements achieve competitive performance, while cross-attention replacement remains a bottleneck and full replacement increases parameter counts and fixes input length constraints. The work suggests that attention is not strictly necessary for competitive performance and highlights optimization and cross-attention modeling as key directions for making attentionless Transformers practically viable.

Abstract

This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model, a state-of-the-art architecture for sequence-to-sequence tasks. We substitute key elements of the attention mechanism in the Transformer with simple feed-forward networks, trained using the original components via knowledge distillation. Our experiments, conducted on the IWSLT2017 dataset, reveal the capacity of these "attentionless Transformers" to rival the performance of the original architecture. Through rigorous ablation studies, and experimenting with various replacement network types and sizes, we offer insights that support the viability of our approach. This not only sheds light on the adaptability of shallow feed-forward networks in emulating attention mechanisms but also underscores their potential to streamline complex architectures for sequence-to-sequence tasks.

Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers

TL;DR

The paper investigates whether shallow feed-forward networks can substitute for Transformer attention in sequence-to-sequence translation. It trains FF replacements via knowledge distillation from a vanilla Transformer and evaluates four encoder self-attention replacement schemes, then extends the best approach to decoder self-attention and cross-attention. On IWSLT2017, encoder self-attention replacements achieve competitive performance, while cross-attention replacement remains a bottleneck and full replacement increases parameter counts and fixes input length constraints. The work suggests that attention is not strictly necessary for competitive performance and highlights optimization and cross-attention modeling as key directions for making attentionless Transformers practically viable.

Abstract

This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model, a state-of-the-art architecture for sequence-to-sequence tasks. We substitute key elements of the attention mechanism in the Transformer with simple feed-forward networks, trained using the original components via knowledge distillation. Our experiments, conducted on the IWSLT2017 dataset, reveal the capacity of these "attentionless Transformers" to rival the performance of the original architecture. Through rigorous ablation studies, and experimenting with various replacement network types and sizes, we offer insights that support the viability of our approach. This not only sheds light on the adaptability of shallow feed-forward networks in emulating attention mechanisms but also underscores their potential to streamline complex architectures for sequence-to-sequence tasks.
Paper Structure (11 sections, 5 figures, 3 tables)

This paper contains 11 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Different encoder self-attention replacement approaches presented.
  • Figure 2: Relative BLEU scores [%] (relative to the baseline Transformer), depending on the FF network size. Encoder self-attention is replaced using different replacement methods.
  • Figure 3: Relative BLEU scores [%] (relative to the baseline), depending on the FF network size. ALR method is used to replace different attention parts of the transformer.
  • Figure 4: Illustration of the training and evaluation cycles of the ALRR method in the encoder self-attention. Replacement in the self-attention and cross-attention layer is analogous. Other replacement methods follow the same principle, with the difference that the input data and teacher labels are taken from the different blocks of the encoder, depending on their structure.
  • Figure 5: Illustration of the necessary data preprocessing and post-processing before and after propagation through the Feed-forward network.