Softmax Attention with Constant Cost per Token
Franz A. Heinsen
TL;DR
This work tackles the quadratic-time cost of standard Transformer attention by proposing a linearized attention mechanism based on exponential kernel feature maps. The core idea expresses attention in log-space as a composition of log-sums of exponentials with a fixed-size latent representation, enabling constant-time per-token computation. A formal proof connects log-sum-exp identities to the modified attention, and implementations for autoregressive and non-autoregressive settings demonstrate practical viability, including a 125M-parameter language model trained on 300B tokens with competitive cross-entropy ($2.47$). While results are encouraging, the authors stress that broader evaluation on larger models and diverse tasks is needed to establish the method's generality and impact. Overall, the paper advances a potentially efficient alternative to Softmax attention with a solid mathematical foundation and practical deployment considerations.
Abstract
We propose a simple modification to the conventional attention mechanism applied by Transformers: Instead of quantifying pairwise query-key similarity with scaled dot-products, we quantify it with the logarithms of scaled dot-products of exponentials. Our modification linearizes attention with exponential kernel feature maps, whose corresponding feature function is infinite dimensional. We show that our modification is expressible as a composition of log-sums of exponentials, with a latent space of constant size, enabling application with constant time and space complexity per token. We implement our modification, verify that it works in practice, and conclude that it is a promising alternative to conventional attention.
