Max-Margin Token Selection in Attention Mechanism

Davoud Ataee Tarzanagh; Yingcong Li; Xuechen Zhang; Samet Oymak

Max-Margin Token Selection in Attention Mechanism

Davoud Ataee Tarzanagh, Yingcong Li, Xuechen Zhang, Samet Oymak

TL;DR

The paper formalizes attention as a max-margin token selection mechanism by analyzing gradient-descent dynamics on attention parameters, showing convergence in direction to a max-margin separator that distinguishes relevant tokens from irrelevant ones. It introduces a contextual dataset model and proves implicit-bias results for both prompt-tuning and self-attention under common losses, including ridge-regularized paths that align with SVM directions under appropriate scaling. The analysis extends to nonlinear heads and multi-context settings, and experiments validate sparsity and focused attention as training progresses. The findings provide a principled explanation for why attention tends to highlight salient tokens and have implications for prompting and architecture design in large transformers.

Abstract

Attention mechanism is a central component of the transformer architecture which led to the phenomenal success of large language models. However, the theoretical principles underlying the attention mechanism are poorly understood, especially its nonconvex optimization dynamics. In this work, we explore the seminal softmax-attention model $f(\boldsymbol{X})=\langle \boldsymbol{Xv}, \texttt{softmax}(\boldsymbol{XWp})\rangle$, where $\boldsymbol{X}$ is the token sequence and $(\boldsymbol{v},\boldsymbol{W},\boldsymbol{p})$ are trainable parameters. We prove that running gradient descent on $\boldsymbol{p}$, or equivalently $\boldsymbol{W}$, converges in direction to a max-margin solution that separates $\textit{locally-optimal}$ tokens from non-optimal ones. This clearly formalizes attention as an optimal token selection mechanism. Remarkably, our results are applicable to general data and precisely characterize $\textit{optimality}$ of tokens in terms of the value embeddings $\boldsymbol{Xv}$ and problem geometry. We also provide a broader regularization path analysis that establishes the margin maximizing nature of attention even for nonlinear prediction heads. When optimizing $\boldsymbol{v}$ and $\boldsymbol{p}$ simultaneously with logistic loss, we identify conditions under which the regularization paths directionally converge to their respective hard-margin SVM solutions where $\boldsymbol{v}$ separates the input features based on their labels. Interestingly, the SVM formulation of $\boldsymbol{p}$ is influenced by the support vector geometry of $\boldsymbol{v}$. Finally, we verify our theoretical findings via numerical experiments and provide insights.

Max-Margin Token Selection in Attention Mechanism

TL;DR

Abstract

, where

is the token sequence and

are trainable parameters. We prove that running gradient descent on

, or equivalently

, converges in direction to a max-margin solution that separates

tokens from non-optimal ones. This clearly formalizes attention as an optimal token selection mechanism. Remarkably, our results are applicable to general data and precisely characterize

of tokens in terms of the value embeddings

and problem geometry. We also provide a broader regularization path analysis that establishes the margin maximizing nature of attention even for nonlinear prediction heads. When optimizing

and

simultaneously with logistic loss, we identify conditions under which the regularization paths directionally converge to their respective hard-margin SVM solutions where

separates the input features based on their labels. Interestingly, the SVM formulation of

is influenced by the support vector geometry of

. Finally, we verify our theoretical findings via numerical experiments and provide insights.

Paper Structure (8 sections, 2 theorems, 10 equations, 4 figures)

This paper contains 8 sections, 2 theorems, 10 equations, 4 figures.

Experiments
Introduction
Attention Mechanisms: Self-attention and Prompt-tuning
Contextual Dataset
Maximum Margin Context Separation
Implicit Bias of Attention
Implicit Bias of Self-Attention
Multi-context Dataset

Key Result

Theorem 1

Suppose training set $\mathcal{S}$ is generated according to CDM. Consider erm with $f=f^{\textsc{pt}}$ and $\ell(\cdot)$ is either squared loss or logistic loss. Fix the classifier head $\bm{w}=\bm{v}_\star$ and only tune the prompt vector ${\bm{q}}$ with iterations ${\bm{q}}_{t+1}\gets {\bm{q}}_t-

Figures (4)

Figure 1: Evolution of softmax probability and attention weights when training with normalized gradient descent or constant step size $\eta$ respectively.
Figure 2: Trajectories of ${\bm{p}}$ with different loss functions and scores in Theorem \ref{['conv:gd:global']}.
Figure 3: Illustration of the progressive change in attention weights of the [CLS] token during training in the transformer model, using a specific input image shown in Figure \ref{['fig:real_image']}.
Figure 4: Red curve is the sparsity level $\widehat{\text{nnz}}(\bm{s})/$$T$ of the average attention map which takes values on [0,1]. A sparser vector implies that few key tokens receive significantly higher attention, while the majority of the tokens receive minimal attention. Blue curve is the Frobenius norm of attention weights $\| \bm{W} \|_F$ of the final layer. We display their evolutions over epochs.

Theorems & Definitions (2)

Theorem 1
Theorem 2

Max-Margin Token Selection in Attention Mechanism

TL;DR

Abstract

Max-Margin Token Selection in Attention Mechanism

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (2)