Table of Contents
Fetching ...

Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis

Rachel S. Y. Teo, Tan M. Nguyen

TL;DR

This work derives self-attention from kernel principal component analysis and shows that self-attention projects its query vectors onto the principal component axes of its key matrix in a feature space, and forms the exact formula for the value matrix in self-attention.

Abstract

The remarkable success of transformers in sequence modeling tasks, spanning various applications in natural language processing and computer vision, is attributed to the critical role of self-attention. Similar to the development of most deep learning models, the construction of these attention mechanisms relies on heuristics and experience. In our work, we derive self-attention from kernel principal component analysis (kernel PCA) and show that self-attention projects its query vectors onto the principal component axes of its key matrix in a feature space. We then formulate the exact formula for the value matrix in self-attention, theoretically and empirically demonstrating that this value matrix captures the eigenvectors of the Gram matrix of the key vectors in self-attention. Leveraging our kernel PCA framework, we propose Attention with Robust Principal Components (RPC-Attention), a novel class of robust attention that is resilient to data contamination. We empirically demonstrate the advantages of RPC-Attention over softmax attention on the ImageNet-1K object classification, WikiText-103 language modeling, and ADE20K image segmentation task.

Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis

TL;DR

This work derives self-attention from kernel principal component analysis and shows that self-attention projects its query vectors onto the principal component axes of its key matrix in a feature space, and forms the exact formula for the value matrix in self-attention.

Abstract

The remarkable success of transformers in sequence modeling tasks, spanning various applications in natural language processing and computer vision, is attributed to the critical role of self-attention. Similar to the development of most deep learning models, the construction of these attention mechanisms relies on heuristics and experience. In our work, we derive self-attention from kernel principal component analysis (kernel PCA) and show that self-attention projects its query vectors onto the principal component axes of its key matrix in a feature space. We then formulate the exact formula for the value matrix in self-attention, theoretically and empirically demonstrating that this value matrix captures the eigenvectors of the Gram matrix of the key vectors in self-attention. Leveraging our kernel PCA framework, we propose Attention with Robust Principal Components (RPC-Attention), a novel class of robust attention that is resilient to data contamination. We empirically demonstrate the advantages of RPC-Attention over softmax attention on the ImageNet-1K object classification, WikiText-103 language modeling, and ADE20K image segmentation task.
Paper Structure (43 sections, 1 theorem, 36 equations, 5 figures, 14 tables, 2 algorithms)

This paper contains 43 sections, 1 theorem, 36 equations, 5 figures, 14 tables, 2 algorithms.

Key Result

Theorem 1

Given a set $M$ of key vectors, $M := \{{\bm{k}}_1,\dots,{\bm{k}}_N\}\subset {\mathbb{R}}^{D}$, a kernel $k({\bm{x}}, {\bm{y}}) := \exp({\bm{x}}^{\top}{\bm{y}}/\sqrt{D})$, and a vector-scalar function $g({\bm{x}}) := \sum_{j=1}^{N} k({\bm{x}},{\bm{k}}_j)$, self-attention performs kernel PCA and proj The feature space ${\bm{\varphi}}$ is induced by the kernel $k_{{\bm{\varphi}}}({\bm{x}}, {\bm{y}})

Figures (5)

  • Figure 1: Projection loss vs. training epochs of ViT-tiny model. The reconstruction loss is averaged over the batch, heads, and layers. The downward trend suggests that the model is implicitly minimizing this projection loss.
  • Figure 2: Mean and standard deviation of the absolute differences of elements in the constant vector $\mathbf{1}\lambda_d$, $d=1,\dots,D_v$. The mean should be $0$ with small standard deviations when $v_{dj}$ are close to the values predicted in Theorem \ref{['theorem:attenion-pca']}. For comparison, we observe that the max, min, mean, and median of the absolute values of all the eigenvalues, averaged over all attention heads and layers, are 648.46, 4.65, 40.07, and 17.73, respectively, which are much greater than the values of $|\gamma_i - \gamma_j|$.
  • Figure 3: Left: Top-1 accuracy of RPC-SymViT vs. baseline SymViT evaluated on PGD/FGSM attacked ImageNet-1K validation set across increasing perturbation budgets. Right: Validation top-1 accuracy (%) and loss of Scaled Attention vs. the baseline asymmetric softmax attention in ViT for the first 50 training epochs.
  • Figure 4: Plot of the validation top-1 accuracy (%) and loss on a log scale of the baseline asymmetric attention ViT and two variants with the parameterization of Remark. \ref{['rm:parameterization']}. The curves are plotted for the full training time and show ${\bm{S}}$ trained as a matrix parameter as well as a scalar parameter scaling a symmetric attention matrix.
  • Figure 5: Plot of the mean and standard deviation of the differences in coordinate values of constant vector $\mathbf{1}\lambda_d$ for $d=1,\dots,D_v$ for all 12 layers of a ViT-tiny model. The mean should be $0$ with small standard deviations when $v_{dj}\approx \frac{a_{dj}}{g({\bm{k}}_j)} - \frac{1}{N}\sum_{j'=1}^N\frac{a_{dj'}}{g({\bm{k}}_{j})}$.

Theorems & Definitions (5)

  • Theorem 1: Softmax Attention as Principal Component Projections
  • Remark 1: Calculating the Gram Matrix $\widetilde{{\bm{K}}}_{{\bm{\varphi}}}$
  • Remark 2: Determining $D_v$
  • Remark 3: Parameterization of the Value Matrix ${\bm{V}}$
  • Definition 1: Attention with Robust Principal Components