Table of Contents
Fetching ...

Support Tokens, Stability Margins, and a New Foundation for Robust LLMs

Deepak Agarwal, Dhyey Dharmendrakumar Mavani, Suyash Gupta, Karthik Sethuraman, Tejas Dharamsi

TL;DR

It is shown that LLMs can be interpreted as a stochastic process over the power set of the token space, providing a rigorous probabilistic framework for sequence modeling and a Bayesian framework is proposed and derived that provides more robust models without sacrificing out-of-sample accuracy.

Abstract

Self-attention is usually described as a flexible, content-adaptive way to mix a token with information from its past. We re-interpret causal self-attention transformers, the backbone of modern foundation models, within a probabilistic framework, much like how classical PCA is extended to probabilistic PCA. However, this re-formulation reveals a surprising and deeper structural insight: due to a change-of-variables phenomenon, a barrier constraint emerges on the self-attention parameters. This induces a highly structured geometry on the token space, providing theoretical insights into the dynamics of LLM decoding. This reveals a boundary where attention becomes ill-conditioned, leading to a margin interpretation similar to classical support vector machines. Just like support vectors, this naturally gives rise to the concept of support tokens. Furthermore, we show that LLMs can be interpreted as a stochastic process over the power set of the token space, providing a rigorous probabilistic framework for sequence modeling. We propose a Bayesian framework and derive a MAP estimation objective that requires only a minimal modification to standard LLM training: the addition of a smooth log-barrier penalty to the usual cross-entropy loss. We demonstrate that this provides more robust models without sacrificing out-of-sample accuracy and that it is straightforward to incorporate in practice.

Support Tokens, Stability Margins, and a New Foundation for Robust LLMs

TL;DR

It is shown that LLMs can be interpreted as a stochastic process over the power set of the token space, providing a rigorous probabilistic framework for sequence modeling and a Bayesian framework is proposed and derived that provides more robust models without sacrificing out-of-sample accuracy.

Abstract

Self-attention is usually described as a flexible, content-adaptive way to mix a token with information from its past. We re-interpret causal self-attention transformers, the backbone of modern foundation models, within a probabilistic framework, much like how classical PCA is extended to probabilistic PCA. However, this re-formulation reveals a surprising and deeper structural insight: due to a change-of-variables phenomenon, a barrier constraint emerges on the self-attention parameters. This induces a highly structured geometry on the token space, providing theoretical insights into the dynamics of LLM decoding. This reveals a boundary where attention becomes ill-conditioned, leading to a margin interpretation similar to classical support vector machines. Just like support vectors, this naturally gives rise to the concept of support tokens. Furthermore, we show that LLMs can be interpreted as a stochastic process over the power set of the token space, providing a rigorous probabilistic framework for sequence modeling. We propose a Bayesian framework and derive a MAP estimation objective that requires only a minimal modification to standard LLM training: the addition of a smooth log-barrier penalty to the usual cross-entropy loss. We demonstrate that this provides more robust models without sacrificing out-of-sample accuracy and that it is straightforward to incorporate in practice.
Paper Structure (70 sections, 20 theorems, 137 equations, 6 figures, 1 table)

This paper contains 70 sections, 20 theorems, 137 equations, 6 figures, 1 table.

Key Result

Proposition 1

Assume $d=1$ and bilinear logits $q_t^\top k_s$ with $q_t=W_Q x_t$ and $k_s=W_K x_s$. Then for each $t$,

Figures (6)

  • Figure 1.1: Margin-to-degeneracy geometry (SVM-style analogy). The curve denotes an instability boundary: configurations where the attention-induced latent-noise map becomes locally ill-conditioned (small perturbations can lead to ambiguous or disproportionate changes). A sequence traces a trajectory of embedding configurations across token positions (dots). The margin of a token is its distance to this boundary (arrow). In this schematic, token 1 is the support token because it has the smallest margin (it lies closest to instability), so it governs the sequence-level stability margin and dominates the barrier pressure. This mirrors SVMs, where support vectors are the points closest to the decision boundary and therefore determine the margin. (Precise definitions appear in Section \ref{['sec:single_layer_margin']}.)
  • Figure 3.2: Prior density under positive and negative coupling ($d{=}1$, $n{=}3$). (a) Conditional density of $x_3$ given $x_1{=}0$, $x_2{=}2$. Positive coupling ($a{=}+0.25$, red) lowers and broadens the peak (from $\approx 0.38$ for the Gaussian baseline to $\approx 0.32$) because the change-of-variables factor $1-a\,\mathrm{Var}_a<1$ reduces log-density at nonzero dispersion. Negative coupling ($a{=}-0.25$, green) sharpens the density and raises the peak to $\approx 0.52$ because $1-a\,\mathrm{Var}_a>1$ amplifies the same region. (b) The diagonal factor $1-a\,\mathrm{Var}_a$ (Eq. \ref{['eq:scalar_diag_deriv']}) as a function of $x_3$: for $a>0$ it decreases toward the degeneracy boundary ($\approx 0.75$ here, reaching $0$ at $\mathrm{Var}_a=1/a$), while for $a<0$ it increases above $1$ (to $\approx 1.25$ here) and remains strictly nondegenerate. (c) Density profiles for $a \in \{-0.55,-0.35,0,+0.35,+0.55\}$: larger positive coupling progressively flattens the density, whereas larger negative coupling progressively sharpens it.
  • Figure 3.3: Positive vs. negative coupling regimes ($d{=}1$, $n{=}5$). 4000 random sequences from $\mathcal{N}(0,4)$ evaluated under positive coupling ($a{=}+0.2$, top row) and negative coupling ($a{=}-0.2$, bottom row). (a) For $a>0$, the diagonal factor follows $1-a\,\mathrm{Var}_a$ exactly, decreases linearly, and reaches $0$ near $\mathrm{Var}_a\approx 5$; only 3387/4000 sequences remain valid. (b) For $a>0$, log-density decreases with $\mathrm{Var}_a$, and configurations near degeneracy are hard-excluded (red points, log-density $=-\infty$). (c) For $a<0$, the factor follows $1+|a|\,\mathrm{Var}_a$, increases linearly, and stays strictly above $1$; all 4000/4000 sequences remain valid, and $\mathrm{Var}_a$ extends to $\approx 16$. (d) For $a<0$, log-density still decreases overall because the Gaussian residual term dominates, but no sequence is excluded and the change-of-variables term partially offsets that decay (about $+\log(4.2)\approx +1.4$ nats at $\mathrm{Var}_a=16$). Together, the panels show that positive coupling induces a true degeneracy barrier, whereas negative coupling removes the barrier and instead promotes larger dispersion.
  • Figure 7.4: Experiment 2: WikiText-2 training curves. Bits-per-character (BPC) on WikiText-2 over 20 epochs for CE-only (blue) and Margin-only (green, $\lambda_m{=}0.05$). (a) Training BPC: both modes converge along nearly identical trajectories, with the margin-regularized model tracking the baseline closely throughout. (b) Validation BPC: the final gap is 0.036 BPC (CE-only: 2.122, Margin-only: 2.158), a relative difference of 1.7%. The margin regularizer preserves predictive quality while imposing the theory-derived dispersion penalty as an additional training signal.
  • Figure 7.5: Experiment 7: BPC under embedding noise on WikiText-2. (a) Absolute BPC as a function of Gaussian noise standard deviation $\sigma$ added to embeddings. Margin-only (green) remains consistently below CE-only (blue) across the entire noise range, with the gap widening at higher noise levels. (b) Relative degradation (noisy BPC / clean BPC): CE-only degrades to $2.58{\times}$ its clean performance at $\sigma{=}0.5$, while Margin-only degrades only $2.46{\times}$---a 12 percentage-point improvement in robustness. The margin regularizer provides measurable protection against embedding perturbations, consistent with the stability-margin interpretation.
  • ...and 1 more figures

Theorems & Definitions (21)

  • Proposition 1: Scalar diagonal derivative
  • Definition 3.1: Support tokens
  • Theorem 1: Diagonal Jacobian block and conditioning
  • Corollary 3.1: Log-likelihood and the context-geometry term
  • Theorem 2: Kolmogorov consistency
  • Proposition 2: Non-causal attention breaks consistency
  • Theorem 3: Transformer stochastic process
  • Proposition 3: No layerwise stability correction under previous-layer conditioning
  • Corollary 6.1: Localization to a single attention-prior stage
  • Lemma B.1: Softmax Jacobian
  • ...and 11 more