Table of Contents
Fetching ...

Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions

Aditya Varre, Mark Rofin, Nicolas Flammarion

TL;DR

It is revealed that gradient flow on this structure inherently drives the optimization toward solutions characterized by low-entropy outputs, demonstrating the universality of this polarizing effect across various objectives, including logistic and square loss.

Abstract

Understanding the intricate non-convex training dynamics of softmax-based models is crucial for explaining the empirical success of transformers. In this article, we analyze the gradient flow dynamics of the value-softmax model, defined as ${L}(\mathbf{V} σ(\mathbf{a}))$, where $\mathbf{V}$ and $\mathbf{a}$ are a learnable value matrix and attention vector, respectively. As the matrix times softmax vector parameterization constitutes the core building block of self-attention, our analysis provides direct insight into transformer's training dynamics. We reveal that gradient flow on this structure inherently drives the optimization toward solutions characterized by low-entropy outputs. We demonstrate the universality of this polarizing effect across various objectives, including logistic and square loss. Furthermore, we discuss the practical implications of these theoretical results, offering a formal mechanism for empirical phenomena such as attention sinks and massive activations.

Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions

TL;DR

It is revealed that gradient flow on this structure inherently drives the optimization toward solutions characterized by low-entropy outputs, demonstrating the universality of this polarizing effect across various objectives, including logistic and square loss.

Abstract

Understanding the intricate non-convex training dynamics of softmax-based models is crucial for explaining the empirical success of transformers. In this article, we analyze the gradient flow dynamics of the value-softmax model, defined as , where and are a learnable value matrix and attention vector, respectively. As the matrix times softmax vector parameterization constitutes the core building block of self-attention, our analysis provides direct insight into transformer's training dynamics. We reveal that gradient flow on this structure inherently drives the optimization toward solutions characterized by low-entropy outputs. We demonstrate the universality of this polarizing effect across various objectives, including logistic and square loss. Furthermore, we discuss the practical implications of these theoretical results, offering a formal mechanism for empirical phenomena such as attention sinks and massive activations.
Paper Structure (57 sections, 10 theorems, 96 equations, 21 figures, 3 tables)

This paper contains 57 sections, 10 theorems, 96 equations, 21 figures, 3 tables.

Key Result

Theorem 3.2

Consider gradient flow on the loss eq:logistic-loss-V-a under the initialization in ass:init. Then, for all $t > 0$,

Figures (21)

  • Figure 1: Representative attention patterns of the 2nd Transformer layer solving an induction task. Sigmoid and linear attentions are trained without additional normalization (see Table \ref{['table:attention-types']}). An attention sink clearly emerges for softmax but not other attentions.
  • Figure 2: Experimental verification of Theorem \ref{['thm:sparsity-logistic']} using logistic loss. The plot shows the evolution of attention scores $\sigma(a)$ and value projections $u$ over time. As predicted, the attention scores converge to a one-hot vector (blue line goes to 1, others to 0) and the value projections $u$ diverge, preserving their order.
  • Figure 3: Evolution of attention logits in a regression problem with condition number $\kappa = 1$ and 5.
  • Figure 4: Proportion of sink heads at the 2nd layer of Transformer trained for the induction task.
  • Figure 5: Evolution of the attention scores for two differently labeled samples in the classification experiment.
  • ...and 16 more figures

Theorems & Definitions (16)

  • Theorem 3.2
  • Theorem 3.3
  • Lemma 3.4
  • Theorem 2.1
  • proof
  • Theorem 2.1
  • proof
  • Lemma 2.1
  • proof
  • Lemma 3.0
  • ...and 6 more