Online Vector Quantized Attention
Nick Alonso, Tomas Figliolia, Beren Millidge
TL;DR
The paper tackles the efficiency-long-context trade-off in transformer-style sequence processing by introducing Online Vector Quantized Attention (OVQ-attention), an online variant of VQ-attention that learns both key and value dictionaries during forward passes. Grounded in Gaussian Mixture Regression, OVQ-attention provides a principled online learning framework with sparse updates and a plateauing dictionary growth that scales memory capacity without inflating per-update cost. Empirically, OVQ-attention delivers strong long-context performance on synthetic in-context recall and learning tasks and competitive long-context language modeling on PG19, often matching or approaching full self-attention while using only a fraction of its memory. The work discusses limitations and future directions, including continual learning benefits and hardware-efficient implementations, suggesting OVQ-attention as a viable path toward scalable, long-context-capable attention mechanisms.
Abstract
Standard sequence mixing layers used in language models struggle to balance efficiency and performance. Self-attention performs well on long context tasks but has expensive quadratic compute and linear memory costs, while linear attention and SSMs use only linear compute and constant memory but struggle with long context processing. In this paper, we develop a sequence mixing layer that aims to find a better compromise between memory-compute costs and long-context processing, which we call online vector-quantized (OVQ) attention. OVQ-attention requires linear compute costs and constant memory, but, unlike linear attention and SSMs, it uses a sparse memory update that allows it to greatly increase the size of its memory state and, consequently, memory capacity. We develop a theoretical basis for OVQ-attention based on Gaussian mixture regression, and we test it on a variety of synthetic long context tasks and on long context language modeling. OVQ-attention shows significant improvements over linear attention baselines and the original VQ-attention, on which OVQ-attention was inspired. It demonstrates competitive, and sometimes identical, performance to strong self-attention baselines up 64k sequence length, despite using a small fraction of the memory of full self-attention.
