Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions

Aditya Varre; Mark Rofin; Nicolas Flammarion

Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions

Aditya Varre, Mark Rofin, Nicolas Flammarion

TL;DR

It is revealed that gradient flow on this structure inherently drives the optimization toward solutions characterized by low-entropy outputs, demonstrating the universality of this polarizing effect across various objectives, including logistic and square loss.

Abstract

Understanding the intricate non-convex training dynamics of softmax-based models is crucial for explaining the empirical success of transformers. In this article, we analyze the gradient flow dynamics of the value-softmax model, defined as ${L}(\mathbf{V} σ(\mathbf{a}))$, where $\mathbf{V}$ and $\mathbf{a}$ are a learnable value matrix and attention vector, respectively. As the matrix times softmax vector parameterization constitutes the core building block of self-attention, our analysis provides direct insight into transformer's training dynamics. We reveal that gradient flow on this structure inherently drives the optimization toward solutions characterized by low-entropy outputs. We demonstrate the universality of this polarizing effect across various objectives, including logistic and square loss. Furthermore, we discuss the practical implications of these theoretical results, offering a formal mechanism for empirical phenomena such as attention sinks and massive activations.

Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions

TL;DR

Abstract

, where

and

are a learnable value matrix and attention vector, respectively. As the matrix times softmax vector parameterization constitutes the core building block of self-attention, our analysis provides direct insight into transformer's training dynamics. We reveal that gradient flow on this structure inherently drives the optimization toward solutions characterized by low-entropy outputs. We demonstrate the universality of this polarizing effect across various objectives, including logistic and square loss. Furthermore, we discuss the practical implications of these theoretical results, offering a formal mechanism for empirical phenomena such as attention sinks and massive activations.

Paper Structure (57 sections, 10 theorems, 96 equations, 21 figures, 3 tables)

This paper contains 57 sections, 10 theorems, 96 equations, 21 figures, 3 tables.

Introduction
Related Work
Attention sinks and massive activations.
Softmax alternatives and entropy collapse.
Sparse attention patterns.
Polarization and population dynamics.
Problem Setup
Notations.
Self attention.
A value-softmax model.
Gradient flow.
Polarizing effect of softmax
Logistic loss
Intuition from replicator dynamics.
Repulsion between the coordinates.
...and 42 more sections

Key Result

Theorem 3.2

Consider gradient flow on the loss eq:logistic-loss-V-a under the initialization in ass:init. Then, for all $t > 0$,

Figures (21)

Figure 1: Representative attention patterns of the 2nd Transformer layer solving an induction task. Sigmoid and linear attentions are trained without additional normalization (see Table \ref{['table:attention-types']}). An attention sink clearly emerges for softmax but not other attentions.
Figure 2: Experimental verification of Theorem \ref{['thm:sparsity-logistic']} using logistic loss. The plot shows the evolution of attention scores $\sigma(a)$ and value projections $u$ over time. As predicted, the attention scores converge to a one-hot vector (blue line goes to 1, others to 0) and the value projections $u$ diverge, preserving their order.
Figure 3: Evolution of attention logits in a regression problem with condition number $\kappa = 1$ and 5.
Figure 4: Proportion of sink heads at the 2nd layer of Transformer trained for the induction task.
Figure 5: Evolution of the attention scores for two differently labeled samples in the classification experiment.
...and 16 more figures

Theorems & Definitions (16)

Theorem 3.2
Theorem 3.3
Lemma 3.4
Theorem 2.1
proof
Theorem 2.1
proof
Lemma 2.1
proof
Lemma 3.0
...and 6 more

Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions

TL;DR

Abstract

Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (16)