Transformer-VQ: Linear-Time Transformers via Vector Quantization

Lucas D. Lingle

Transformer-VQ: Linear-Time Transformers via Vector Quantization

Lucas D. Lingle

TL;DR

Transformer-VQ delivers a decoder-only transformer that computes dense self-attention in linear time by quantizing keys and caching compressed statistics. The approach preserves a dense-attention equivalence while enabling efficient training and sampling, backed by large-scale experiments showing strong performance on long-context benchmarks such as Enwik8, PG-19, and ImageNet64. Key contributions include a detailed quadratic-time formulation, a linear-time decoder attention recurrence, a VQ-based learning objective with EMA-codebooks, and extensive ablations demonstrating the impact of codebook size and the compressive cache. The results indicate substantial practical benefits for long-context autoregressive modeling, with competitive accuracy and notable throughput gains that scale to very long sequences.

Abstract

We introduce Transformer-VQ, a decoder-only transformer computing softmax-based dense self-attention in linear time. Transformer-VQ's efficient attention is enabled by vector-quantized keys and a novel caching mechanism. In our large-scale experiments, Transformer-VQ is shown highly competitive in quality, obtaining 0.99 bpb on Enwik8, 26.6 ppl on PG-19, and 3.16 bpb on ImageNet64. In addition, the optimized implementation of Transformer-VQ is over 3x faster than a comparable quadratic-time transformer at sequence length 8k, is over 12x faster at 32k, and can scale to 131k with similar throughput. Code available: \url{https://github.com/transformer-vq/transformer_vq}

Transformer-VQ: Linear-Time Transformers via Vector Quantization

TL;DR

Abstract

Paper Structure (47 sections, 7 theorems, 17 equations, 11 figures, 11 tables)

This paper contains 47 sections, 7 theorems, 17 equations, 11 figures, 11 tables.

Introduction
Preliminaries
Notation
Vector Quantization
Vector Quantizers and Codebooks
Vector-Quantized Representation Learning
Transformer-VQ
Quadratic-Time Formulation
Warmup: Linear-Time Encoder Attention
Linear-Time Decoder Attention
Learning Algorithm
Training Loss
Training Updates
Related Work
Hierarchical Attention
...and 32 more sections

Key Result

Theorem 2.2

Let $\mathbf{q} \in \mathbb{R}^{D}$ be a random variable with $\mathbb{E}_{\mathbf{q}}[\mathbf{q}\mathbf{q}^{\top}] = \sigma^{2} \mathbf{I}_{D}$ for some $\sigma > 0$, and let $\mathbf{k} \in \mathbb{R}^{D}$ be a random variable independent of $\mathbf{q}$. Let $\varphi : \mathbb{R}^{D} \to \mathbb{

Figures (11)

Figure 1: Schematic of the VQ-Attention approximation. The colorful and blank boxes depict the keys and attention weights, respectively. The keys on the right have been vector-quantized. Since the green keys ${k}_{2}, {k}_{5}$ map to the same code, they have the same attention weights in this attention head.
Figure 2: Schematic of the VQ-Attention factorization with element-wise $\phi_{w}$. The column set of $\mathbf{W} = \phi_{w}(\mathbf{Q}\hat{\mathbf{K}}^{\top}) \in \mathbb{R}^{T \times T}$ has size $\leq S$ due to VQ, so the attention output $\mathbf{O} = \mathbf{W}\mathbf{V}$ can be obtained by computing the unique attention scores $\phi_{w}(\mathbf{Q}{\mathbf{C}}^{\top})$ and using them to further aggregate to the grouped-sum $\mathbf{\Delta}\mathbf{V}$. Transformer-VQ uses a softmax-based extension of this idea for its cache.
Figure 3: Generated samples from our large ImageNet64 model; nucleus 1.0.
Figure 4: Sample excerpt from our PG-19 model, generated with nucleus 0.8.
Figure 5: Generated samples from our large ImageNet64 model; nucleus 0.999.
...and 6 more figures

Theorems & Definitions (24)

Definition 2.1
Theorem 2.2: Based on RGuo2019
Corollary 2.3
Corollary 2.4
Remark 2.5
Definition 2.6: Based on vanDenOord2017
Remark 2.7
Remark 2.8
Definition 3.1
Remark 3.2
...and 14 more

Transformer-VQ: Linear-Time Transformers via Vector Quantization

TL;DR

Abstract

Transformer-VQ: Linear-Time Transformers via Vector Quantization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (24)