Transformers as Support Vector Machines

Davoud Ataee Tarzanagh; Yingcong Li; Christos Thrampoulidis; Samet Oymak

Transformers as Support Vector Machines

Davoud Ataee Tarzanagh, Yingcong Li, Christos Thrampoulidis, Samet Oymak

TL;DR

This work formalizes a deep connection between transformer self-attention and hard-margin SVMs by showing that optimizing the attention mechanism corresponds to solving max-margin problems over token pairs. It differentiates the implicit biases of the two common parameterizations: Frobenius-norm bias for W-parameterization and nuclear-norm bias for KQ-factorization, with overparameterization enabling global convergence to the max-margin directions. The authors prove global convergence under favorable initial gradients and via overparameterization, and they establish local convergence results that lead to locally-optimal token selections; they further extend the SVM equivalence to nonlinear prediction heads and sequential/causal prediction settings. Empirical results validate the theoretical predictions and illustrate scenarios where attention selects a single token or multiple tokens, offering a principled, SVM-based lens to interpret transformer optimization and generalization.

Abstract

Since its inception in "Attention Is All You Need", transformer architecture has led to revolutionary advancements in NLP. The attention layer within the transformer admits a sequence of input tokens $X$ and makes them interact through pairwise similarities computed as softmax$(XQK^\top X^\top)$, where $(K,Q)$ are the trainable key-query parameters. In this work, we establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem that separates optimal input tokens from non-optimal tokens using linear constraints on the outer-products of token pairs. This formalism allows us to characterize the implicit bias of 1-layer transformers optimized with gradient descent: (1) Optimizing the attention layer with vanishing regularization, parameterized by $(K,Q)$, converges in direction to an SVM solution minimizing the nuclear norm of the combined parameter $W=KQ^\top$. Instead, directly parameterizing by $W$ minimizes a Frobenius norm objective. We characterize this convergence, highlighting that it can occur toward locally-optimal directions rather than global ones. (2) Complementing this, we prove the local/global directional convergence of gradient descent under suitable geometric conditions. Importantly, we show that over-parameterization catalyzes global convergence by ensuring the feasibility of the SVM problem and by guaranteeing a benign optimization landscape devoid of stationary points. (3) While our theory applies primarily to linear prediction heads, we propose a more general SVM equivalence that predicts the implicit bias with nonlinear heads. Our findings are applicable to arbitrary datasets and their validity is verified via experiments. We also introduce several open problems and research directions. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.

Transformers as Support Vector Machines

TL;DR

Abstract

Since its inception in "Attention Is All You Need", transformer architecture has led to revolutionary advancements in NLP. The attention layer within the transformer admits a sequence of input tokens

and makes them interact through pairwise similarities computed as softmax

, where

are the trainable key-query parameters. In this work, we establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem that separates optimal input tokens from non-optimal tokens using linear constraints on the outer-products of token pairs. This formalism allows us to characterize the implicit bias of 1-layer transformers optimized with gradient descent: (1) Optimizing the attention layer with vanishing regularization, parameterized by

, converges in direction to an SVM solution minimizing the nuclear norm of the combined parameter

. Instead, directly parameterizing by

minimizes a Frobenius norm objective. We characterize this convergence, highlighting that it can occur toward locally-optimal directions rather than global ones. (2) Complementing this, we prove the local/global directional convergence of gradient descent under suitable geometric conditions. Importantly, we show that over-parameterization catalyzes global convergence by ensuring the feasibility of the SVM problem and by guaranteeing a benign optimization landscape devoid of stationary points. (3) While our theory applies primarily to linear prediction heads, we propose a more general SVM equivalence that predicts the implicit bias with nonlinear heads. Our findings are applicable to arbitrary datasets and their validity is verified via experiments. We also introduce several open problems and research directions. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.

Paper Structure (43 sections, 27 theorems, 218 equations, 13 figures)

This paper contains 43 sections, 27 theorems, 218 equations, 13 figures.

Introduction
Preliminaries
Optimal tokens and hard-margin SVM problem for cross attention
Understanding the Implicit Bias of Self-Attention
Global Convergence of Gradient Descent
Properties of optimization landscape
Provable global convergence of 1-layer transformer
(I) Global convergence under good initial gradient.
(II) Global convergence via overparameterization.
Understanding Local Convergence of 1-Layer Transformer
Local convergence of gradient descent
Overparameterization conjecture: When local-optimal directions disappear
Investigation on SVM objectives and GD convergence
Guarantees on local regularization path
Toward A More General SVM Equivalence for Nonlinear Prediction Heads
...and 28 more sections

Key Result

Theorem 1

Suppose $d\geq \max(T-1,n)$. Then, almost all datasets $(Y_i,{\bm{X}}_i,{\bm{z}}_i)_{i=1}^n$ -- including the self-attention setting with ${\bm{z}}_i\gets\bm{x}_{i1}$ -- obey the following: eqn:sattnsvm is feasible i.e., $\bm{W}^\textsl{mm}$ separates the desired tokens $\texttt{opt}=(\texttt{opt}_i

Figures (13)

Figure 1: GD convergence during training of cross-attention weight $\bm{W}$ or $({\bm{K}},{\bm{Q}})$ with data. Teal and yellow markers represent tokens from ${\bm{X}}_1$ and ${\bm{X}}_2$, while stars mark optimal tokens. Solid lines in Figures (a) and (b) depict \ref{['eqn:sattnsvm']} and \ref{['eqn:sattnsvmst']} directions mapped to ${\bm{z}}_1$ (red) and ${\bm{z}}_2$ (blue), respectively. Arrows illustrating GD trajectories converging towards these SVM directions. Red and blue dotted lines represent the corresponding separating hyperplanes.
Figure 2: Percentage of different convergence types when training cross-attention weights ($\bm{W}$) using GD and varying dimension ($d$). Red and blue bars represent the percentages of convergence to globally-optimal and locally-optimal (including global) SVM solutions, respectively. Teal bars are complements of the blue bars. Larger overparameterization ($d$) increases the likelihood of global convergence.
Figure 3: Implicit biases of the attention layer and logistic regression.
Figure 4: Rank range of solutions for \ref{['eqn:sattnsvm']} and \ref{['eqn:sattnsvmst']}, denoted as $\bm{W}^\textsl{mm}$ and $\bm{W}^\textsl{mm}_{\star}$, solved using optimal tokens $(\texttt{opt}_i)_{i=1}^n$ and setting $m=d$ (the rank constraint is eliminated). Both figures confirm ranks of $\bm{W}^\textsl{mm}$ and $\bm{W}^\textsl{mm}_\star$ are bounded by $\max(n,d)$, validating Lemma \ref{['lem:rank']}.
Figure 5: Percentage of different convergence types of GD when training cross-attention weights (a): $\bm{W}$ or (b): $({\bm{K}},{\bm{Q}})$ with varying $d$. In both figures, red, blue, and teal bars represent the percentages of Global, Local (including Global), and Not Local convergence, respectively. The green bar corresponds to Assumption \ref{['assum:token:supp']} where all tokens act as support vectors. Larger overparameterization ($d$) relates to a higher percentage of globally-optimal SVM convergence.
...and 8 more figures

Theorems & Definitions (32)

Definition 1: Token Score and Optimality
Theorem 1
Lemma 1
Lemma 2: Optimal Tokens Minimize Training Loss
Theorem 2
Lemma 3
Lemma 4
Theorem 3
Theorem 4
Definition 2: Support Indices and Locally-Optimal Direction
...and 22 more

Transformers as Support Vector Machines

TL;DR

Abstract

Transformers as Support Vector Machines

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (32)