Table of Contents
Fetching ...

Generalized Probabilistic Attention Mechanism in Transformers

DongNyeong Heo, Heeyoul Choi

TL;DR

This paper proposes a generalized probabilistic attention mechanism (GPAM) and its dual-attention implementation within the Transformer architecture and empirically validate the theoretical advantages, demonstrating the superiority of daGPAM compared to other alternative attention mechanisms that were proposed to address the same issues.

Abstract

The Transformer architecture has become widely adopted due to its demonstrated success, attributed to the attention mechanism at its core. Despite these successes, the attention mechanism of Transformers is associated with two well-known issues: rank-collapse and gradient vanishing. In this paper, we present a theoretical analysis that it is inherently difficult to address both issues simultaneously in the conventional attention mechanism. To handle these issues, we introduce a novel class of attention mechanism, referred to as generalized probabilistic attention mechanism (GPAM), and its dual-attention implementation within the Transformer architecture. Unlike conventional attention mechanisms, GPAM allows for negative attention scores while preserving a fixed total sum. We provide theoretical evidence that the proposed dual-attention GPAM (daGPAM) effectively mitigates both the rank-collapse and gradient vanishing issues which are difficult to resolve simultaneously with the conventional attention mechanisms. Furthermore, we empirically validate this theoretical evidence, demonstrating the superiority of daGPAM compared to other alternative attention mechanisms that were proposed to address the same issues. Additionally, we demonstrate the practical benefits of GPAM in natural language processing tasks, such as language modeling and neural machine translation.

Generalized Probabilistic Attention Mechanism in Transformers

TL;DR

This paper proposes a generalized probabilistic attention mechanism (GPAM) and its dual-attention implementation within the Transformer architecture and empirically validate the theoretical advantages, demonstrating the superiority of daGPAM compared to other alternative attention mechanisms that were proposed to address the same issues.

Abstract

The Transformer architecture has become widely adopted due to its demonstrated success, attributed to the attention mechanism at its core. Despite these successes, the attention mechanism of Transformers is associated with two well-known issues: rank-collapse and gradient vanishing. In this paper, we present a theoretical analysis that it is inherently difficult to address both issues simultaneously in the conventional attention mechanism. To handle these issues, we introduce a novel class of attention mechanism, referred to as generalized probabilistic attention mechanism (GPAM), and its dual-attention implementation within the Transformer architecture. Unlike conventional attention mechanisms, GPAM allows for negative attention scores while preserving a fixed total sum. We provide theoretical evidence that the proposed dual-attention GPAM (daGPAM) effectively mitigates both the rank-collapse and gradient vanishing issues which are difficult to resolve simultaneously with the conventional attention mechanisms. Furthermore, we empirically validate this theoretical evidence, demonstrating the superiority of daGPAM compared to other alternative attention mechanisms that were proposed to address the same issues. Additionally, we demonstrate the practical benefits of GPAM in natural language processing tasks, such as language modeling and neural machine translation.

Paper Structure

This paper contains 37 sections, 9 theorems, 37 equations, 8 figures, 6 tables.

Key Result

Lemma 1

For any single scaled dot-product self-attention layer with a term $\gamma$ that depends on the attention entries, the composite norm of output residual is bounded by where $\mathbf{W}_{QK}=\mathbf{W}_Q\mathbf{W}_K^\top$. In the region that holds $4\sqrt{2}\gamma\|\mathbf{W}_{QK}\|_{1}\|\mathbf{W}_{V}\|_{1,\infty}<\sqrt{d_{qk}}$, the output residual norm is diminished compared to the cubic rate o

Figures (8)

  • Figure 1: Examples of convex and affine combinations $\mathbf{Y}^c$ and $\mathbf{Y}^a$, given $\mathbf{X}$ value representations.
  • Figure 2: Our proposed daGPAM structure in an example of two multi-head self-attention in Transformer.
  • Figure 3: Different output space and representations according to $\lambda$ combinations. $\lambda^{-}$ varies while $\lambda^{+}$ is fixed to 1.
  • Figure 4: The results of rank-collapse analyses (left two graphs) and gradient histories during training (right two graphs). Horizontal axis of rank-collapse analyses indicate layer index, while those of gradient histories indicate training iterations.
  • Figure 5: Results of faithfulness test (rank-collapse analysis at initialization) varying $\lambda^{+}$ with fixing $\lambda^{-}$ to 1.
  • ...and 3 more figures

Theorems & Definitions (12)

  • Lemma 1: dong2021attention, Simplified
  • Lemma 2: Gradient Vanishing in Attention Mechanism
  • Lemma 3: Maximum Total Norm of Gradients
  • Lemma 4: Dual-Attention GPAM residual Bound, Simplified
  • Lemma 5: Dual-Attention GPAM Gradients
  • Lemma 6: dong2021attention
  • Lemma 7: Maximum Total Norm of Gradients
  • proof
  • Lemma 8: Dual-Attention GPAM residual Bound, Completed
  • proof
  • ...and 2 more