Table of Contents
Fetching ...

Exclusive Self Attention

Shuangfei Zhai

TL;DR

Exclusive self attention (XSA) is introduced, a simple modification of self attention that improves Transformer's sequence modeling performance and shows increasingly larger gains as sequence length grows.

Abstract

We introduce exclusive self attention (XSA), a simple modification of self attention (SA) that improves Transformer's sequence modeling performance. The key idea is to constrain attention to capture only information orthogonal to the token's own value vector (thus excluding information of self position), encouraging better context modeling. Evaluated on the standard language modeling task, XSA consistently outperforms SA across model sizes up to 2.7B parameters and shows increasingly larger gains as sequence length grows.

Exclusive Self Attention

TL;DR

Exclusive self attention (XSA) is introduced, a simple modification of self attention that improves Transformer's sequence modeling performance and shows increasingly larger gains as sequence length grows.

Abstract

We introduce exclusive self attention (XSA), a simple modification of self attention (SA) that improves Transformer's sequence modeling performance. The key idea is to constrain attention to capture only information orthogonal to the token's own value vector (thus excluding information of self position), encouraging better context modeling. Evaluated on the standard language modeling task, XSA consistently outperforms SA across model sizes up to 2.7B parameters and shows increasingly larger gains as sequence length grows.
Paper Structure (15 sections, 2 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 15 sections, 2 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Visualization of the attention similarity bias of a 1.3B parameter language model of sequence length 2048 trained for 100B tokens, aggregated on 1024 random training sequences. Left: the average cosine similarity of value vectors $v_i$, $v_j$ within a sequence; middle: the average diagonal attention value $a_{i,j}$; right: the average cosine similarity of attention output $y_i$ and the self value vector $v_i$. See Eq. \ref{['eq:sa']} for notations.
  • Figure 2: Time and memory efficiency of XSA compared to standard attention. XSA introduces minimal computational overhead across various sequence lengths and model sizes $d_{model}$.
  • Figure 3: Training and validation loss curves of XSA against the baseline Transformer for three model sizes.
  • Figure 4: Training and validation loss of XSA against the baseline Transformer for various learning rates evaluated with the 1.3B model.
  • Figure 5: Training and validation loss of XSA against the baseline Transformer for various sequence lengths evaluated with the 1.3B model.
  • ...and 1 more figures