Softmax is not Enough (for Sharp Size Generalisation)
Petar Veličković, Christos Perivolaropoulos, Federico Barbero, Razvan Pascanu
TL;DR
This work shows that softmax-based attention cannot sustain sharp computations as input size grows, proving that attention coefficients must disperse to near-uniform with increasing tokens. It introduces adaptive temperature, an entropy-guided inference-time mechanism that adjusts softmax temperature to sharpen selections while maintaining efficiency via streaming entropy computation. Empirical results on Max Retrieval and CLRS-Text demonstrate improvements in sharpness and accuracy on larger inputs, highlighting practical gains and remaining limitations. The findings motivate exploring non-softmax or hybrid attention mechanisms to achieve robust long-input generalisation in reasoning systems.
Abstract
A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from "circuits" which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions with increasing problem size, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.
