Table of Contents
Fetching ...

Transformers as Support Vector Machines

Davoud Ataee Tarzanagh, Yingcong Li, Christos Thrampoulidis, Samet Oymak

TL;DR

This work formalizes a deep connection between transformer self-attention and hard-margin SVMs by showing that optimizing the attention mechanism corresponds to solving max-margin problems over token pairs. It differentiates the implicit biases of the two common parameterizations: Frobenius-norm bias for W-parameterization and nuclear-norm bias for KQ-factorization, with overparameterization enabling global convergence to the max-margin directions. The authors prove global convergence under favorable initial gradients and via overparameterization, and they establish local convergence results that lead to locally-optimal token selections; they further extend the SVM equivalence to nonlinear prediction heads and sequential/causal prediction settings. Empirical results validate the theoretical predictions and illustrate scenarios where attention selects a single token or multiple tokens, offering a principled, SVM-based lens to interpret transformer optimization and generalization.

Abstract

Since its inception in "Attention Is All You Need", transformer architecture has led to revolutionary advancements in NLP. The attention layer within the transformer admits a sequence of input tokens $X$ and makes them interact through pairwise similarities computed as softmax$(XQK^\top X^\top)$, where $(K,Q)$ are the trainable key-query parameters. In this work, we establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem that separates optimal input tokens from non-optimal tokens using linear constraints on the outer-products of token pairs. This formalism allows us to characterize the implicit bias of 1-layer transformers optimized with gradient descent: (1) Optimizing the attention layer with vanishing regularization, parameterized by $(K,Q)$, converges in direction to an SVM solution minimizing the nuclear norm of the combined parameter $W=KQ^\top$. Instead, directly parameterizing by $W$ minimizes a Frobenius norm objective. We characterize this convergence, highlighting that it can occur toward locally-optimal directions rather than global ones. (2) Complementing this, we prove the local/global directional convergence of gradient descent under suitable geometric conditions. Importantly, we show that over-parameterization catalyzes global convergence by ensuring the feasibility of the SVM problem and by guaranteeing a benign optimization landscape devoid of stationary points. (3) While our theory applies primarily to linear prediction heads, we propose a more general SVM equivalence that predicts the implicit bias with nonlinear heads. Our findings are applicable to arbitrary datasets and their validity is verified via experiments. We also introduce several open problems and research directions. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.

Transformers as Support Vector Machines

TL;DR

This work formalizes a deep connection between transformer self-attention and hard-margin SVMs by showing that optimizing the attention mechanism corresponds to solving max-margin problems over token pairs. It differentiates the implicit biases of the two common parameterizations: Frobenius-norm bias for W-parameterization and nuclear-norm bias for KQ-factorization, with overparameterization enabling global convergence to the max-margin directions. The authors prove global convergence under favorable initial gradients and via overparameterization, and they establish local convergence results that lead to locally-optimal token selections; they further extend the SVM equivalence to nonlinear prediction heads and sequential/causal prediction settings. Empirical results validate the theoretical predictions and illustrate scenarios where attention selects a single token or multiple tokens, offering a principled, SVM-based lens to interpret transformer optimization and generalization.

Abstract

Since its inception in "Attention Is All You Need", transformer architecture has led to revolutionary advancements in NLP. The attention layer within the transformer admits a sequence of input tokens and makes them interact through pairwise similarities computed as softmax, where are the trainable key-query parameters. In this work, we establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem that separates optimal input tokens from non-optimal tokens using linear constraints on the outer-products of token pairs. This formalism allows us to characterize the implicit bias of 1-layer transformers optimized with gradient descent: (1) Optimizing the attention layer with vanishing regularization, parameterized by , converges in direction to an SVM solution minimizing the nuclear norm of the combined parameter . Instead, directly parameterizing by minimizes a Frobenius norm objective. We characterize this convergence, highlighting that it can occur toward locally-optimal directions rather than global ones. (2) Complementing this, we prove the local/global directional convergence of gradient descent under suitable geometric conditions. Importantly, we show that over-parameterization catalyzes global convergence by ensuring the feasibility of the SVM problem and by guaranteeing a benign optimization landscape devoid of stationary points. (3) While our theory applies primarily to linear prediction heads, we propose a more general SVM equivalence that predicts the implicit bias with nonlinear heads. Our findings are applicable to arbitrary datasets and their validity is verified via experiments. We also introduce several open problems and research directions. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
Paper Structure (43 sections, 27 theorems, 218 equations, 13 figures)

This paper contains 43 sections, 27 theorems, 218 equations, 13 figures.

Key Result

Theorem 1

Suppose $d\geq \max(T-1,n)$. Then, almost all datasets $(Y_i,{\bm{X}}_i,{\bm{z}}_i)_{i=1}^n$ -- including the self-attention setting with ${\bm{z}}_i\gets\bm{x}_{i1}$ -- obey the following: eqn:sattnsvm is feasible i.e., $\bm{W}^\textsl{mm}$ separates the desired tokens $\texttt{opt}=(\texttt{opt}_i

Figures (13)

  • Figure 1: GD convergence during training of cross-attention weight $\bm{W}$ or $({\bm{K}},{\bm{Q}})$ with data. Teal and yellow markers represent tokens from ${\bm{X}}_1$ and ${\bm{X}}_2$, while stars mark optimal tokens. Solid lines in Figures (a) and (b) depict \ref{['eqn:sattnsvm']} and \ref{['eqn:sattnsvmst']} directions mapped to ${\bm{z}}_1$ (red) and ${\bm{z}}_2$ (blue), respectively. Arrows illustrating GD trajectories converging towards these SVM directions. Red and blue dotted lines represent the corresponding separating hyperplanes.
  • Figure 2: Percentage of different convergence types when training cross-attention weights ($\bm{W}$) using GD and varying dimension ($d$). Red and blue bars represent the percentages of convergence to globally-optimal and locally-optimal (including global) SVM solutions, respectively. Teal bars are complements of the blue bars. Larger overparameterization ($d$) increases the likelihood of global convergence.
  • Figure 3: Implicit biases of the attention layer and logistic regression.
  • Figure 4: Rank range of solutions for \ref{['eqn:sattnsvm']} and \ref{['eqn:sattnsvmst']}, denoted as $\bm{W}^\textsl{mm}$ and $\bm{W}^\textsl{mm}_{\star}$, solved using optimal tokens $(\texttt{opt}_i)_{i=1}^n$ and setting $m=d$ (the rank constraint is eliminated). Both figures confirm ranks of $\bm{W}^\textsl{mm}$ and $\bm{W}^\textsl{mm}_\star$ are bounded by $\max(n,d)$, validating Lemma \ref{['lem:rank']}.
  • Figure 5: Percentage of different convergence types of GD when training cross-attention weights (a): $\bm{W}$ or (b): $({\bm{K}},{\bm{Q}})$ with varying $d$. In both figures, red, blue, and teal bars represent the percentages of Global, Local (including Global), and Not Local convergence, respectively. The green bar corresponds to Assumption \ref{['assum:token:supp']} where all tokens act as support vectors. Larger overparameterization ($d$) relates to a higher percentage of globally-optimal SVM convergence.
  • ...and 8 more figures

Theorems & Definitions (32)

  • Definition 1: Token Score and Optimality
  • Theorem 1
  • Lemma 1
  • Lemma 2: Optimal Tokens Minimize Training Loss
  • Theorem 2
  • Lemma 3
  • Lemma 4
  • Theorem 3
  • Theorem 4
  • Definition 2: Support Indices and Locally-Optimal Direction
  • ...and 22 more