Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis

Rachel S. Y. Teo; Tan M. Nguyen

Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis

Rachel S. Y. Teo, Tan M. Nguyen

TL;DR

This work derives self-attention from kernel principal component analysis and shows that self-attention projects its query vectors onto the principal component axes of its key matrix in a feature space, and forms the exact formula for the value matrix in self-attention.

Abstract

The remarkable success of transformers in sequence modeling tasks, spanning various applications in natural language processing and computer vision, is attributed to the critical role of self-attention. Similar to the development of most deep learning models, the construction of these attention mechanisms relies on heuristics and experience. In our work, we derive self-attention from kernel principal component analysis (kernel PCA) and show that self-attention projects its query vectors onto the principal component axes of its key matrix in a feature space. We then formulate the exact formula for the value matrix in self-attention, theoretically and empirically demonstrating that this value matrix captures the eigenvectors of the Gram matrix of the key vectors in self-attention. Leveraging our kernel PCA framework, we propose Attention with Robust Principal Components (RPC-Attention), a novel class of robust attention that is resilient to data contamination. We empirically demonstrate the advantages of RPC-Attention over softmax attention on the ImageNet-1K object classification, WikiText-103 language modeling, and ADE20K image segmentation task.

Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis

TL;DR

Abstract

Paper Structure (43 sections, 1 theorem, 36 equations, 5 figures, 14 tables, 2 algorithms)

This paper contains 43 sections, 1 theorem, 36 equations, 5 figures, 14 tables, 2 algorithms.

Introduction
Principal Component Analysis of Attention
Deriving Attention from Kernel PCA
Analysis on the Convergence of Self-Attention Layers to Kernel PCA
Projection Error Minimization
Learning Eigenvectors of in Eqn. (\ref{['eqn:eig_K']})
Robust Softmax Attention
Principal Component Pursuit
Attention with Robust Principal Components
Experimental Results
Vision Tasks: ImageNet-1K Object Classification
Language Tasks: WikiText-103 Language Modeling
Validating the Benefits of Scaled Attention
Related Works
Concluding Remarks
...and 28 more sections

Key Result

Theorem 1

Given a set $M$ of key vectors, $M := \{{\bm{k}}_1,\dots,{\bm{k}}_N\}\subset {\mathbb{R}}^{D}$, a kernel $k({\bm{x}}, {\bm{y}}) := \exp({\bm{x}}^{\top}{\bm{y}}/\sqrt{D})$, and a vector-scalar function $g({\bm{x}}) := \sum_{j=1}^{N} k({\bm{x}},{\bm{k}}_j)$, self-attention performs kernel PCA and proj The feature space ${\bm{\varphi}}$ is induced by the kernel $k_{{\bm{\varphi}}}({\bm{x}}, {\bm{y}})

Figures (5)

Figure 1: Projection loss vs. training epochs of ViT-tiny model. The reconstruction loss is averaged over the batch, heads, and layers. The downward trend suggests that the model is implicitly minimizing this projection loss.
Figure 2: Mean and standard deviation of the absolute differences of elements in the constant vector $\mathbf{1}\lambda_d$, $d=1,\dots,D_v$. The mean should be $0$ with small standard deviations when $v_{dj}$ are close to the values predicted in Theorem \ref{['theorem:attenion-pca']}. For comparison, we observe that the max, min, mean, and median of the absolute values of all the eigenvalues, averaged over all attention heads and layers, are 648.46, 4.65, 40.07, and 17.73, respectively, which are much greater than the values of $|\gamma_i - \gamma_j|$.
Figure 3: Left: Top-1 accuracy of RPC-SymViT vs. baseline SymViT evaluated on PGD/FGSM attacked ImageNet-1K validation set across increasing perturbation budgets. Right: Validation top-1 accuracy (%) and loss of Scaled Attention vs. the baseline asymmetric softmax attention in ViT for the first 50 training epochs.
Figure 4: Plot of the validation top-1 accuracy (%) and loss on a log scale of the baseline asymmetric attention ViT and two variants with the parameterization of Remark. \ref{['rm:parameterization']}. The curves are plotted for the full training time and show ${\bm{S}}$ trained as a matrix parameter as well as a scalar parameter scaling a symmetric attention matrix.
Figure 5: Plot of the mean and standard deviation of the differences in coordinate values of constant vector $\mathbf{1}\lambda_d$ for $d=1,\dots,D_v$ for all 12 layers of a ViT-tiny model. The mean should be $0$ with small standard deviations when $v_{dj}\approx \frac{a_{dj}}{g({\bm{k}}_j)} - \frac{1}{N}\sum_{j'=1}^N\frac{a_{dj'}}{g({\bm{k}}_{j})}$.

Theorems & Definitions (5)

Theorem 1: Softmax Attention as Principal Component Projections
Remark 1: Calculating the Gram Matrix $\widetilde{{\bm{K}}}_{{\bm{\varphi}}}$
Remark 2: Determining $D_v$
Remark 3: Parameterization of the Value Matrix ${\bm{V}}$
Definition 1: Attention with Robust Principal Components

Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis

TL;DR

Abstract

Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (5)