Table of Contents
Fetching ...

Online Vector Quantized Attention

Nick Alonso, Tomas Figliolia, Beren Millidge

TL;DR

The paper tackles the efficiency-long-context trade-off in transformer-style sequence processing by introducing Online Vector Quantized Attention (OVQ-attention), an online variant of VQ-attention that learns both key and value dictionaries during forward passes. Grounded in Gaussian Mixture Regression, OVQ-attention provides a principled online learning framework with sparse updates and a plateauing dictionary growth that scales memory capacity without inflating per-update cost. Empirically, OVQ-attention delivers strong long-context performance on synthetic in-context recall and learning tasks and competitive long-context language modeling on PG19, often matching or approaching full self-attention while using only a fraction of its memory. The work discusses limitations and future directions, including continual learning benefits and hardware-efficient implementations, suggesting OVQ-attention as a viable path toward scalable, long-context-capable attention mechanisms.

Abstract

Standard sequence mixing layers used in language models struggle to balance efficiency and performance. Self-attention performs well on long context tasks but has expensive quadratic compute and linear memory costs, while linear attention and SSMs use only linear compute and constant memory but struggle with long context processing. In this paper, we develop a sequence mixing layer that aims to find a better compromise between memory-compute costs and long-context processing, which we call online vector-quantized (OVQ) attention. OVQ-attention requires linear compute costs and constant memory, but, unlike linear attention and SSMs, it uses a sparse memory update that allows it to greatly increase the size of its memory state and, consequently, memory capacity. We develop a theoretical basis for OVQ-attention based on Gaussian mixture regression, and we test it on a variety of synthetic long context tasks and on long context language modeling. OVQ-attention shows significant improvements over linear attention baselines and the original VQ-attention, on which OVQ-attention was inspired. It demonstrates competitive, and sometimes identical, performance to strong self-attention baselines up 64k sequence length, despite using a small fraction of the memory of full self-attention.

Online Vector Quantized Attention

TL;DR

The paper tackles the efficiency-long-context trade-off in transformer-style sequence processing by introducing Online Vector Quantized Attention (OVQ-attention), an online variant of VQ-attention that learns both key and value dictionaries during forward passes. Grounded in Gaussian Mixture Regression, OVQ-attention provides a principled online learning framework with sparse updates and a plateauing dictionary growth that scales memory capacity without inflating per-update cost. Empirically, OVQ-attention delivers strong long-context performance on synthetic in-context recall and learning tasks and competitive long-context language modeling on PG19, often matching or approaching full self-attention while using only a fraction of its memory. The work discusses limitations and future directions, including continual learning benefits and hardware-efficient implementations, suggesting OVQ-attention as a viable path toward scalable, long-context-capable attention mechanisms.

Abstract

Standard sequence mixing layers used in language models struggle to balance efficiency and performance. Self-attention performs well on long context tasks but has expensive quadratic compute and linear memory costs, while linear attention and SSMs use only linear compute and constant memory but struggle with long context processing. In this paper, we develop a sequence mixing layer that aims to find a better compromise between memory-compute costs and long-context processing, which we call online vector-quantized (OVQ) attention. OVQ-attention requires linear compute costs and constant memory, but, unlike linear attention and SSMs, it uses a sparse memory update that allows it to greatly increase the size of its memory state and, consequently, memory capacity. We develop a theoretical basis for OVQ-attention based on Gaussian mixture regression, and we test it on a variety of synthetic long context tasks and on long context language modeling. OVQ-attention shows significant improvements over linear attention baselines and the original VQ-attention, on which OVQ-attention was inspired. It demonstrates competitive, and sometimes identical, performance to strong self-attention baselines up 64k sequence length, despite using a small fraction of the memory of full self-attention.
Paper Structure (30 sections, 31 equations, 12 figures, 5 tables)

This paper contains 30 sections, 31 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Preliminary test of in-context recall in model interleaving sliding window and VQ-attention layers. Differing number of centroids, $\texttt{N}$, are tested.
  • Figure 2: Process for generating OVQ-attention output for token at position $T$.
  • Figure 3: State updates in linear and OVQ-attention models.
  • Figure 4: In-context recall. Left two plots show per-token-accuracy for our two synthetic recall tasks up to 64k context length. The right plot shows how the memory state, i.e. kv-cache, grows with context length.
  • Figure 5: In context learning. Models are trained on 2k context with 16 functions. We show the per-token accuracy over the output, $\mathbf{y}_n$, for each example, $n$, in the context, averaged over the test set.
  • ...and 7 more figures