Spectral Conditioning of Attention Improves Transformer Performance

Hemanth Saratchandran; Simon Lucey

Spectral Conditioning of Attention Improves Transformer Performance

Hemanth Saratchandran, Simon Lucey

TL;DR

A method is introduced that systematically alters the spectral properties of each attention layer to reduce the Jacobian's condition number, thereby improving the overall conditioning of the attention layers within a transformer network.

Abstract

We present a theoretical analysis of the Jacobian of an attention block within a transformer, showing that it is governed by the query, key, and value projections that define the attention mechanism. Leveraging this insight, we introduce a method that systematically alters the spectral properties of each attention layer to reduce the Jacobian's condition number, thereby improving the overall conditioning of the attention layers within a transformer network. We empirically show that this improved Jacobian conditioning translates to enhanced performance in practice. Our approach is simple, broadly applicable, and can be easily integrated as a drop-in replacement for a wide range of existing attention mechanisms. We validate its effectiveness across diverse transformer architectures and tasks, demonstrating consistent improvements in performance.

Spectral Conditioning of Attention Improves Transformer Performance

TL;DR

Abstract

Paper Structure (51 sections, 7 theorems, 43 equations, 21 figures, 12 tables)

This paper contains 51 sections, 7 theorems, 43 equations, 21 figures, 12 tables.

Introduction
Related Work
Conditioning.
Attention.
Methodology
Preliminaries
Main Theorems
Overview of proof of \ref{['thm:attention_weights_regularize']}:
Spectral Conditioned Attention
Implementation:
Experiments
Implementation.
Image Classification
Vision transformers.
Validating the theory.
...and 36 more sections

Key Result

Lemma 3.2

Let $\Lambda : \mathbb{R}^n \rightarrow \mathbb{R}^{n\times n}$ denote the function $\Lambda(z) = Diag(z) - z\cdot z^T$. We then have that

Figures (21)

Figure 1: An illustration of spectrally conditioned self-attention within a transformer layer. At each layer, the self-attention weights $W_Q$, $W_K$, and $W_V$ are modified by adding correction terms $C_Q$, $C_K$, and $C_V$, respectively. The correction terms $C_Q$, $C_K$, and $C_V$ are initialized before training using \ref{['thm:imp_friendly']} and remain fixed throughout training.
Figure 2: Analysis for ViT-B. Left: Average minimum singular value of the query, key, and value projection matrices ($W_Q$, $W_K$, $W_V$) and their spectrally conditioned counterparts ($W_Q + C_Q$, $W_K + C_K$, $W_V + C_V$) throughout training. Middle: Condition numbers of $W_Q$, $W_K$, and $W_V$, and their spectrally conditioned forms during training. Right: Average condition number of the self-attention Jacobian over the course of training, before and after spectral conditioning, along with the theoretical bound from \ref{['eqn:cond_jac']}.
Figure 3: Analysis for XCiT-M. Left: Average minimum singular value of the query, key, and value projection matrices ($W_Q$, $W_K$, $W_V$) and their spectrally conditioned counterparts ($W_Q + C_Q$, $W_K + C_K$, $W_V + C_V$) throughout training. Middle: Condition numbers of $W_Q$, $W_K$, and $W_V$, and their spectrally conditioned forms during training. Right: Average condition number of the self-attention Jacobian over the course of training, before and after spectral conditioning, along with the theoretical bound from \ref{['eqn:cond_jac']}.
Figure 4: Analysis for Nyströmformer on text classification task. Left: Average minimum singular value of the query, key, and value projection matrices ($W_Q$, $W_K$, $W_V$) and their spectrally conditioned counterparts ($W_Q + C_Q$, $W_K + C_K$, $W_V + C_V$) throughout training. Middle: Condition numbers of $W_Q$, $W_K$, and $W_V$, and their spectrally conditioned forms during training. Right: Average condition number of the attention Jacobian over the course of training, before and after spectral conditioning, along with the theoretical bound from \ref{['eqn:cond_jac']}.
Figure 5: Left: Average minimum singular value of the query, key, and value projection matrices ($W_Q$, $W_K$, $W_V$) for a ViT-B during training. We plot the mean over five trials and the standard deviation. Right: Average minimum singular value of the corrected query, key, and value projection matrices ($W_Q + C_Q$, $W_K + C_K$, $W_V + C_V$) for a ViT-B during training. We plot the mean over five trials and the standard deviation.
...and 16 more figures

Theorems & Definitions (17)

Definition 3.1
Lemma 3.2
Theorem 3.3
Theorem 3.4
Theorem 3.5
Definition 3.6
Remark 3.7
Theorem 3.8
Lemma A.1
proof
...and 7 more

Spectral Conditioning of Attention Improves Transformer Performance

TL;DR

Abstract

Spectral Conditioning of Attention Improves Transformer Performance

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (17)