Table of Contents
Fetching ...

Softmax Attention with Constant Cost per Token

Franz A. Heinsen

TL;DR

This work tackles the quadratic-time cost of standard Transformer attention by proposing a linearized attention mechanism based on exponential kernel feature maps. The core idea expresses attention in log-space as a composition of log-sums of exponentials with a fixed-size latent representation, enabling constant-time per-token computation. A formal proof connects log-sum-exp identities to the modified attention, and implementations for autoregressive and non-autoregressive settings demonstrate practical viability, including a 125M-parameter language model trained on 300B tokens with competitive cross-entropy ($2.47$). While results are encouraging, the authors stress that broader evaluation on larger models and diverse tasks is needed to establish the method's generality and impact. Overall, the paper advances a potentially efficient alternative to Softmax attention with a solid mathematical foundation and practical deployment considerations.

Abstract

We propose a simple modification to the conventional attention mechanism applied by Transformers: Instead of quantifying pairwise query-key similarity with scaled dot-products, we quantify it with the logarithms of scaled dot-products of exponentials. Our modification linearizes attention with exponential kernel feature maps, whose corresponding feature function is infinite dimensional. We show that our modification is expressible as a composition of log-sums of exponentials, with a latent space of constant size, enabling application with constant time and space complexity per token. We implement our modification, verify that it works in practice, and conclude that it is a promising alternative to conventional attention.

Softmax Attention with Constant Cost per Token

TL;DR

This work tackles the quadratic-time cost of standard Transformer attention by proposing a linearized attention mechanism based on exponential kernel feature maps. The core idea expresses attention in log-space as a composition of log-sums of exponentials with a fixed-size latent representation, enabling constant-time per-token computation. A formal proof connects log-sum-exp identities to the modified attention, and implementations for autoregressive and non-autoregressive settings demonstrate practical viability, including a 125M-parameter language model trained on 300B tokens with competitive cross-entropy (). While results are encouraging, the authors stress that broader evaluation on larger models and diverse tasks is needed to establish the method's generality and impact. Overall, the paper advances a potentially efficient alternative to Softmax attention with a solid mathematical foundation and practical deployment considerations.

Abstract

We propose a simple modification to the conventional attention mechanism applied by Transformers: Instead of quantifying pairwise query-key similarity with scaled dot-products, we quantify it with the logarithms of scaled dot-products of exponentials. Our modification linearizes attention with exponential kernel feature maps, whose corresponding feature function is infinite dimensional. We show that our modification is expressible as a composition of log-sums of exponentials, with a latent space of constant size, enabling application with constant time and space complexity per token. We implement our modification, verify that it works in practice, and conclude that it is a promising alternative to conventional attention.
Paper Structure (8 sections, 14 equations)