Softmax is not Enough (for Sharp Size Generalisation)

Petar Veličković; Christos Perivolaropoulos; Federico Barbero; Razvan Pascanu

Softmax is not Enough (for Sharp Size Generalisation)

Petar Veličković, Christos Perivolaropoulos, Federico Barbero, Razvan Pascanu

TL;DR

This work shows that softmax-based attention cannot sustain sharp computations as input size grows, proving that attention coefficients must disperse to near-uniform with increasing tokens. It introduces adaptive temperature, an entropy-guided inference-time mechanism that adjusts softmax temperature to sharpen selections while maintaining efficiency via streaming entropy computation. Empirical results on Max Retrieval and CLRS-Text demonstrate improvements in sharpness and accuracy on larger inputs, highlighting practical gains and remaining limitations. The findings motivate exploring non-softmax or hybrid attention mechanisms to achieve robust long-input generalisation in reasoning systems.

Abstract

A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from "circuits" which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions with increasing problem size, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.

Softmax is not Enough (for Sharp Size Generalisation)

TL;DR

Abstract

Paper Structure (22 sections, 5 theorems, 37 equations, 9 figures, 1 table)

This paper contains 22 sections, 5 theorems, 37 equations, 9 figures, 1 table.

Motivation
Key Theoretical Result
Background
Primer on Attentional Heads and Transformers
Dispersion in softmax and Transformers
Practical Values of $\delta$
Adaptive Temperature
Experimental Results
Max Retrieval
CLRS-Text
Conclusions
Experimental Details for the Maximum Entry Retrieval Task
Motivation
Data Generation
Neural Network Architecture
...and 7 more sections

Key Result

Lemma 2.1

Let $\mathbf{e}^{(n)}\in\mathbb{R}^{n}$ be a collection of $n$ logits going into the $\mathtt{softmax}_\theta$ function with temperature $\theta > 0$, bounded above and below s.t. $m\leq e^{(n)}_k\leq M$ for some choice of constants $m, M\in\mathbb{R}$. Then, as more items are added ($n\rightarrow +

Figures (9)

Figure 1: Illustration of Theorem \ref{['thm:disperse']}, one of our key results. Assuming a tokenised input from a fixed vocabulary and a non-zero temperature, for every softmax attention head inside an architecture comprising only MLPs and softmax self-attention layers, it must hold that, given sufficiently many tokens, its attention coefficients will disperse, even if they were sharp in-distribution.
Figure 2: Visualising the attentional head for the max retrieval task for a batch of $32$ randomly-sampled input sets (each represented by one of the rows), over the $16$ items with largest key (columns). If the head operates correctly, it must allocate sharp attention to the rightmost item. From left to right, in each frame we double the number of items the head has to process (starting from $16$ items).
Figure 3: Entropy of attention heads in the first block of Gemma 2B with prompt "What is the maximum in the following sequence: {seq}? The maximum is:" and varying the number of elements in seq. Each curve is one attentional head; the blue shaded curve is the mean and standard deviation across all of them.
Figure 4: Our implementation of adaptive temperature in JAX.
Figure 5: Entropy of $\mathtt{softmax}_\theta(\mathbf{a})$ for 10 elements of a power series $a_i = \lambda^i$, split into four regions depending on range of $(\lambda, \theta)$. Degenerate cases: near $\lambda = 0$ and $\lambda = 1$ (all logits equal).
...and 4 more figures

Theorems & Definitions (13)

Lemma 2.1: softmax must disperse
proof
Theorem 2.2: softmax in Transformers over vocabularies must disperse
proof
Proposition 3.1: Sharpness in Transformers necessitates large weights
Proposition 3.2: Decreasing temperature decreases entropy
Corollary 2.1: Dispersion induces reasoning failures
proof
Remark 2.2
Remark 2.3
...and 3 more

Softmax is not Enough (for Sharp Size Generalisation)

TL;DR

Abstract

Softmax is not Enough (for Sharp Size Generalisation)

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (13)